Datatamer / tamr-client

Programmatically interact with Tamr
https://tamr-client.readthedocs.io
Apache License 2.0
11 stars 25 forks source link

Streaming large datasets into memory #281

Closed nargesmsh92 closed 4 years ago

nargesmsh92 commented 5 years ago

🐛Connection broken: IncompleteRead while fetching data

Right now, streaming a large dataset into memory can either cause

  1. Connection broken: IncompleteRead Error: This error is caused by httplib.IncompletedRead which is an underlying package of request. This error happens when server wrongly close session and httplib raise ChunkedEncodingError and bury the received bytes.
  2. OOM Error: This error is not really related to the python-client package rather caused by user's code for converting the tamr-client's generator object to a data frame. This can happen by dumping all the received data into memory as a list of JSON. For instance, in code below, the tamr-clientrecord_generator object is converted to a list of JSON and then the list of JSON is used to create a data frame but this resulted a OOM issue (picture below).
    record_generator = unify.datasets.by_resource_id(datasetid).records()
    print('Start Converting Generator Object to string')
    records_as_strings = [{key:"|".join(val) if ((isinstance(val, list)) & (val is not None)) else val for key, val in record.items()} for record in record_generator]
    print('Start Converting to dataFrame')
    df = pd.DataFrame(records_as_strings)

    image I was able to work around this issue by changing my code as follows:

    df = pd.DataFrame()
    for record in record_generator:
        records_as_strings = [{key:"|".join(val) if ((isinstance(val, list)) & (val is not None)) else val for key, val in record.items()}]
        df = df.append(records_as_strings)

But then the first error happened.

🤔 Expected Behavior/Possible Solution

  1. Maybe using try/except in record function to keep trying to connect to the Unify.
  2. Fetching data as a batch using iter_content(chunk_size).

🌍 Your Environment

Software Version(s)
tamr-unify-client 0.3.0
Tamr Unify server v2019.015
Python 3.6.8
Operating System
nbateshaus commented 4 years ago

something like more_itertools.flatten(more_itertools.chunked(dataset.records(), CHUNK_SIZE)) should pre-fetch batches of CHUNK_SIZE, causing less code thrashing. I've found CHUNK_SIZE of 1,000 to 10,000 to be useful.

nbateshaus commented 4 years ago

FWIW, I just ran into a connection timeout using the above code snippet, so it's not the be-all / end-all.

pcattori commented 4 years ago

Is this an issue you would also encounter simply loading the dataset from a local CSV file into a pandas.DataFrame? Or does that work, but the client is adding unreasonable memory overhead that causes this.

From what I've read the issue seems to be inherent to storing all that data in a DataFrame in memory. If so, is the DataFrame just a means to an end (e.g. to then save the DF as a CSV)?

@nargesmsh92 @nbateshaus

pcattori commented 4 years ago

@nbateshaus do you know if the timeout you were seeing is client-side or server-side?

nbateshaus commented 4 years ago

It's a server timeout. The main one that I run into is when streaming (HTTP CHUNKED), there's something like a 1-minute timeout between chunks. Once the client has read a chunk, if the client doesn't read the next chunk from the server within a minute, the server closes the connection. There are also timeouts for the initial read and a timeout for reading the entire stream.

pcattori commented 4 years ago

378 added docs for loading datasets into dataframes that should cover most use-cases. If used correctly, these techniques should avoid client-side failures due to memory usage.

Keeping issue open as we don't have any methods (client-side, server-side, nor hybrid) for avoiding the timeouts described by @nbateshaus ^.

pcattori commented 4 years ago

Closing this since there has been inactivity and we now have docs that seem to address this.

Feel free to reopen if there are more improvements you'd like to request.