Datatamer / tamr-client

Programmatically interact with Tamr
https://tamr-client.readthedocs.io
Apache License 2.0
11 stars 25 forks source link

Use Python Client to stream large data set e.g. data set with 10 M records #296

Closed yizhiyintamr closed 4 years ago

yizhiyintamr commented 5 years ago

🙋 feature request

Use Python Client to stream large data set e.g. data set with 10 M records

🤔 Expected Behavior

Python Client do not give a Connection broken error when streaming large data set.

😯 Current Behavior

It gave such an error.

💁 Possible Solution

Increase timeout? I used curl to stream the large data set as a workaround.

🔦 Context

Transamerica sent us two data sets with a total of 10 M records and wanted us to demonstrate how their data scientists team (live on Domino Lab's jupyter notebook) can interact with Tamr to build pipeline to ingest, export, and analyze data with 10 M records. I was able to stream data set up to 5 M records; however, I need to use curl to stream the 10 M record data set.

💻 Examples

I was able to stream 5M records. But I got an ('Connection broken: IncompleteRead(138 bytes read, 374 more expected)', IncompleteRead(138 bytes read, 374 more expected)) error when I us dt.records() to stream a 10M records data set. The connection broke after 8 minutes. I was able to use curl to get the data set. It would be nice to have the ability to stream large data set in python.

pcattori commented 5 years ago

Closing as a duplicate of #281