Explore the possibility to remove the dependencies of Spark and Hadoop

kpto commented 4 weeks ago

Currently the project use pyspark which depends on Spark and Hadoop, dependencies that do not natively run on Windows and make the script not Windows friendly, see #13 for the details. It seems that pyspark is used for reading parquet only rather than an actual distributed computing. To make the script more OS agnostic, I suggest to find an alternative crossplatformed solution to read parquet files.

As to my knowledge, pyspark under the hood uses pyarrow to read parquets so pandas perhaps can replace pyspark as it optionally uses pyarrow to read parquet. Yet, the code uses pyspark SQL style queries to pre-process the data which makes the substitution less easy. Further investigation is needed.

laulopezreal commented 2 weeks ago

Hi @kpto,

This is a very valid point. Have you come across Polars? https://docs.pola.rs/ It is a pretty good alternative to pandas developed in rust and its gaining friction within the python data science community. It is open source (though the founder has founded a start-up to commercialise products built on top of the library).

Not sure how it compares to pyspark, but it claims to provide up to 10X speed compared with pandas in big data processing in Python.

This is a very cool podcast in which they interview the co-founder. Worth the watch/listening (it is also available for free on spotify): https://www.youtube.com/watch?v=TZeTO-CJPns

kpto commented 2 weeks ago

@laulopezreal Hi thank you for your input and apologise for not responding promptly, I missed your comment here.

Our need is simple actually. Since BioCypher processes row by row so we just need to find a package that allows streaming rows to Python runtime without loading the whole dataset into the memory. Turns out it is surprisingly difficult to find one. I have tested Polars but their streaming feature is still experimental and it's unclear to me that how to really iterate rows one by one. Please see this issue: #25, an issue was mentioned inside as well showing that Polars is not ready for this. Their streaming engine is still improving and I look forward to it's development.

In the end duckdb is the one that does it the best. The memory control is excellent and the overhead of streaming is sensible so I plan to use duckdb. I have also tried pyarrow buy it lacks query operators and it's memory control is not comprehensible to me.

biocypher / open-targets

Explore the possibility to remove the dependencies of Spark and Hadoop #17