Bears-R-Us / arkouda

Arkouda (αρκούδα): Interactive Data Analytics at Supercomputing Scale :bear:
Other
239 stars 87 forks source link

Inconsistent parquet read/write times #2736

Open stress-tess opened 1 year ago

stress-tess commented 1 year ago

Users have noticed that parquet write times can be wildly inconsistent, like a few seconds or a few minutes for the same type of write task

User provided summary of the problem:

I've noticed that when writing large datasets with parquet, the per-locale files don't all get written in parallel like I'd expect. Instead a few of the files are created and grow to their full size, then a few more are created and grow, then a few more, and so on. The overall write operation takes much longer than if all the files were written in parallel. In arkouda's chapel code, the write1DDistStringsAggregators function (used for writing string columns) has a "gather" step to copy string data from other nodes into local memory before calling the Apache Arrow library function to actually write the parquet file. I don't know chapel, but I'm guessing that when a process is running non chapel code like the arrow library calls, it might not respond to requests from sibling processes until it returns to the running chapel code. If so that creates a race condition: when a node finishes gathering its string data and calls arrow to write the file, it stops responding to other nodes which are still gathering strings and need data from this one; those processes end up having to wait until this node finishes writing its files (and returns to chapel) before they can start writing theirs. I could be completely wrong about this, but it's plausible and explain the behavior I've observed. This could be resolved by putting a barrier step in between gathering and writing to ensure all nodes have finished gathering remote string data before nay arrow calls occur. (I assume Chapel supports barriers, since they're a standard concurrency primative). I've also noticed a large variation in time when reading large datasets from Chapel -- reading the same data is sometimes much slower than other times, like 7 minutes instead of 1. Maybe there's something similar happening on the arrow calls to load from Parquet?

bmcdonald3 commented 11 months ago

Do we know if these were dataframes or individual string columns being written?