cur.fetch() can be extremely slow

BAM-BAM-BAM commented 9 years ago

I haven't quite figured out when this is the case, but some queries can cause cursor.fetch() to take several hours to complete even though just running the query from a hive CLI and piping to a file only take 60 seconds.

Is there anything that I can do to help you diagose this problem?

BradRuderman commented 9 years ago

how large its the data set you are retrieving? Is this the case for all queries or just a few?

BAM-BAM-BAM commented 9 years ago

Hive 0.13.0.2.1.4.0-632

Correction: piping to a file from the CLI takes about 5-8 minutes, not 1 minute.

The data is stored in a managed table.

The returned data size is 4.7M rows * 13 columns, or about 306MB (Hive 'describe extended tablename' reports that the size of the table is 878MB).

All queries seem to run slowly using cursor.fetch().

My "feeling" is that something in the python code is inefficiently allocating & filling memory, or maybe just grabbing one or a few rows at a time.

When I watch the memory usage of the python process it climbs slowly but steadily.

BAM-BAM-BAM commented 9 years ago

Using pyhs2 cursor::fetch() on our cluster, reading a 66 column table (a single 10-character date column, an integer column, and 64 float columns): fetched 3000 rows in 10.7 secs fetched 30000 rows in 104.2 secs

So it’s reading about 20,000 elements a second (30000 rows x 66 columns ~= 2,000,000 elements). A crappy HD reads at around 50MB/s on average. If each element was 2,500 bytes, then it should take this long to read. But each element is a float (let’s just say it’s double-precision, or 8 bytes). So the read speed is 300x slower than it should be.

From this, I’d guess that 300,000 rows takes 17 minutes and 3,000,000 rows takes about three hours

I'm running cProfile now...

BAM-BAM-BAM commented 9 years ago

cProfile stats:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
11753141/11753119   85.004    0.000  279.991    0.000 TTransport.py:54(readAll)
 11753118   63.297    0.000  162.265    0.000 thrift_sasl.py:148(read)
      643   59.518    0.093   59.518    0.093 {method 'recv' of '_socket.socket' objects}
  1980000   49.948    0.000  434.618    0.000 ttypes.py:1773(read)
35261749/35261748   49.587    0.000   49.587    0.000 {len}
  7248226   40.663    0.000  189.522    0.000 TBinaryProtocol.py:195(readByte)
  7218080   34.901    0.000  307.126    0.000 TBinaryProtocol.py:154(readFieldBegin)
    30000   28.844    0.001  478.409    0.016 ttypes.py:1894(read)
 11753128   22.575    0.000   22.575    0.000 {method 'read' of 'cStringIO.StringI' objects}
 11722981   20.213    0.000   20.213    0.000 {_struct.unpack}
  3227507   17.853    0.000   83.466    0.000 TBinaryProtocol.py:200(readI16)
   930000   15.269    0.000  121.194    0.000 ttypes.py:1320(read)
  1020000   11.462    0.000   72.414    0.000 ttypes.py:1440(read)
        4    7.290    1.822  496.383  124.096 cursor.py:219(_fetch)
  1980000    7.114    0.000    7.114    0.000 ttypes.py:1764(__init__)
  4020468    6.236    0.000    6.236    0.000 {method 'append' of 'list' objects}
  3990573    6.000    0.000    6.000    0.000 TBinaryProtocol.py:148(readStructBegin)
  3990573    5.984    0.000    5.984    0.000 TBinaryProtocol.py:151(readStructEnd)
  1980000    5.471    0.000    5.471    0.000 cursor.py:24(get_value)
   932268    5.276    0.000   83.881    0.000 TBinaryProtocol.py:205(readI32)
  3227507    4.872    0.000    4.872    0.000 TBinaryProtocol.py:161(readFieldEnd)
        4    2.353    0.588  498.736  124.684 cursor.py:71(fetchSet)
  1020000    1.922    0.000    1.922    0.000 ttypes.py:1437(__init__)
   930000    1.769    0.000    1.769    0.000 ttypes.py:1317(__init__)
   314964    1.737    0.000    8.287    0.000 TBinaryProtocol.py:215(readDouble)
    30000    0.500    0.000    4.830    0.000 ttypes.py:1500(read)
       66    0.403    0.006    0.403    0.006 {pandas.lib.maybe_convert_objects}
        4    0.311    0.078  478.833  119.708 ttypes.py:1971(read)
    30138    0.187    0.000    1.717    0.000 TBinaryProtocol.py:173(readListBegin)
    30149    0.183    0.000    1.545    0.000 TBinaryProtocol.py:220(readString)
       33    0.104    0.003    0.104    0.003 {pandas.lib.is_possible_datetimelike_array}
    30000    0.061    0.000    0.061    0.000 ttypes.py:1497(__init__)
    30000    0.060    0.000    0.060    0.000 ttypes.py:1891(__init__)
        1    0.049    0.049    0.049    0.049 {pandas.lib.to_object_array}
    30138    0.047    0.000    0.047    0.000 TBinaryProtocol.py:178(readListEnd)

zklopman-rovi commented 9 years ago

answer is simple: don't use fetch(). use one of the Python standard PEP 249 calls: fetchall() (not a good idea for a large data set), fetchone() (for going one row at a time) or fetchmany() (return a block of records at a time). just using fetch() might (and probably does) overwhelms your system and should not be generally used.

anshanno commented 8 years ago

@zklopman-rovi what option would you recommend works best for a large data set? Sorry, I am a little late to the party.

zklopman commented 8 years ago

that depends on your needs, and what you consider "large". In general, you should not overwhelm the memory. Assuming a typical 4GB available RAM, I would day that around a million normal sized (not very wide) records can still be handled in fetchall(). If you have more than that, bring them over in blocks (10,000 or 100,000 records) using fetchmany(). If you get blocks, you could dispatch them to separate threads to be worked upon (if feasible). If your number of records is way more than that (billions or more) you should not be exporting them to a lonely python program to be processed - harness the power of Hive (or other Hadoop technologies) for your work and trim it down. Exact details are, of course, dependent on the application and environment.

hope this helps,

Zachi Klopman

On Wed, May 25, 2016 at 1:55 PM, anshanno notifications@github.com wrote:

@zklopman-rovi https://github.com/zklopman-rovi what option would you recommend works best for a large data set? Sorry, I am a little late to the party.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/BradRuderman/pyhs2/issues/36#issuecomment-221653381

anshanno commented 8 years ago

@zklopman well, I have about 170 million rows and 200 columns so by big data standards, its not that big. It seems however that it is far too large to pull in a reasonable amount of time with this module though. It takes around 45 seconds per 10k block. Fetching seems to be the bottleneck of the module.

zklopman commented 8 years ago

Hi,

170,000,000 x 200 x 5 bytes/col (assumption - could be much bigger) ~ 170GB data, (and we do not even consider protocol overhead here). That is usually way too much to handle on a little laptop or small server, that usually don't have more than 16GB RAM. For this size, you should do your calculations on the cluster or use other means to transfer your data.

As for the interface speed:

10,000x200x5 ~ 10MB in 45s seems slow (about 1/2 MB/s). I would expect it to be at least 10x faster with normal hardware and network speeds. Of course, if the typical width of the columns is not 5 but 50 bytes, that will make a huge difference. There are also other issues of network performance tuning, but those vary wildly between locations.

hope this helps,

Zachi

On Thu, May 26, 2016 at 9:12 AM, anshanno notifications@github.com wrote:

@zklopman https://github.com/zklopman well, I have about 170 million rows and 200 columns so by big data standards, its not that big. It seems however that it is far too large to pull in a reasonable amount of time with this module though. It takes around 45 seconds per 10k block. Fetching seems to be the bottleneck of the module.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/BradRuderman/pyhs2/issues/36#issuecomment-221866316

BradRuderman / pyhs2

cur.fetch() can be extremely slow #36