Closed djouallah closed 5 hours ago
Hi @djouallah!
Were there any changes here that could have triggered this regression in behavior of your data pipeline? I'm curious specifically about:
getdaft
pyarrow
Hi @djouallah!
Were there any changes here that could have triggered this regression in behavior of your data pipeline? I'm curious specifically about:
- Version of
getdaft
- Version of
pyarrow
- Any environmental changes in your storage, credentials etc
I tried with multiple version of getdaft, same issue, it seems the runtime was upgraded to python 3.11 and pyarrow to version 14, I use deltalake 0.17.4 as the rust writer don't support writing in batch, it does load everything in memory
Gotcha. I'm guessing that this isn't actually a regression with Daft given that switching the version of Daft doesn't solve it.
2 suggestions:
actually you are right, it seems the issue is pyarrow 14, doing an update, fixed the issue
Thanks for confirming @djouallah !
after further testing, I think I find the issue, it seems the new parallel csv reader is a bit unstable when reading and writing to different partition, workaround for me is to pin daft version to 0.3.9
Hi @djouallah, sorry to hear about the issues with the new csv reader. It does max out concurrency, which maybe is hammering IO too hard leading to the issues that you see.
We're going to add some rate limiting like we're doing with the parallel parquet reader here which might alleviate this.
In the meantime I would like to make a repro to see if we can help you sooner. Could you describe the workload and data sizes that are leading to the errors?
same thile this but my data is around 2300 files not 60
https://colab.research.google.com/drive/1HRbkztwjAhHR6bAQQIsAlLPaZNqs9eVG#scrollTo=PccFouvE6N9w
maybe it is not daft issue after all, I am writing directly to abfss and it works great, maybe the issue is when using a mounted storage
Describe the bug
a pipeline using daft has stopped working , no idea what change
To Reproduce
I don't know, if this useful
Expected behavior
No response
Component(s)
Python Runner
Additional context
No response