Closed nargesmsh92 closed 4 years ago
something like more_itertools.flatten(more_itertools.chunked(dataset.records(), CHUNK_SIZE))
should pre-fetch batches of CHUNK_SIZE, causing less code thrashing. I've found CHUNK_SIZE of 1,000 to 10,000 to be useful.
FWIW, I just ran into a connection timeout using the above code snippet, so it's not the be-all / end-all.
Is this an issue you would also encounter simply loading the dataset from a local CSV file into a pandas.DataFrame
? Or does that work, but the client is adding unreasonable memory overhead that causes this.
From what I've read the issue seems to be inherent to storing all that data in a DataFrame in memory. If so, is the DataFrame just a means to an end (e.g. to then save the DF as a CSV)?
@nargesmsh92 @nbateshaus
@nbateshaus do you know if the timeout you were seeing is client-side or server-side?
It's a server timeout. The main one that I run into is when streaming (HTTP CHUNKED), there's something like a 1-minute timeout between chunks. Once the client has read a chunk, if the client doesn't read the next chunk from the server within a minute, the server closes the connection. There are also timeouts for the initial read and a timeout for reading the entire stream.
Keeping issue open as we don't have any methods (client-side, server-side, nor hybrid) for avoiding the timeouts described by @nbateshaus ^.
Closing this since there has been inactivity and we now have docs that seem to address this.
Feel free to reopen if there are more improvements you'd like to request.
🐛Connection broken: IncompleteRead while fetching data
Right now, streaming a large dataset into memory can either cause
Connection broken: IncompleteRead
Error: This error is caused byhttplib.IncompletedRead
which is an underlying package of request. This error happens when server wrongly close session andhttplib
raiseChunkedEncodingError
and bury the received bytes.tamr-client
record_generator object is converted to a list of JSON and then the list of JSON is used to create a data frame but this resulted a OOM issue (picture below).I was able to work around this issue by changing my code as follows:
But then the first error happened.
🤔 Expected Behavior/Possible Solution
🌍 Your Environment