aphp / eds-scikit

eds-scikit is a Python library providing tools to process and analyse OMOP data
https://aphp.github.io/eds-scikit
BSD 3-Clause "New" or "Revised" License
35 stars 5 forks source link

New arrow connector to save locally #37

Closed Thomzoy closed 1 year ago

Thomzoy commented 1 year ago

Description

The current method to save a subcohort locally relies on the toPandas method of PySpark, which isn't efficient and prone to errors.
This PR implements a pyarrow connector which reads the parquet file directly

Checklist

codecov-commenter commented 1 year ago

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.01 :tada:

Comparison is base (6ec366e) 91.88% compared to head (666b6ad) 91.90%.

:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #37 +/- ## ========================================== + Coverage 91.88% 91.90% +0.01% ========================================== Files 67 68 +1 Lines 2145 2149 +4 ========================================== + Hits 1971 1975 +4 Misses 174 174 ``` | [Impacted Files](https://codecov.io/gh/aphp/eds-scikit/pull/37?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aphp) | Coverage Δ | | |---|---|---| | [eds\_scikit/io/arrow.py](https://codecov.io/gh/aphp/eds-scikit/pull/37?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aphp#diff-ZWRzX3NjaWtpdC9pby9hcnJvdy5weQ==) | `100.00% <100.00%> (ø)` | | | [eds\_scikit/io/hive.py](https://codecov.io/gh/aphp/eds-scikit/pull/37?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aphp#diff-ZWRzX3NjaWtpdC9pby9oaXZlLnB5) | `100.00% <100.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aphp). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aphp)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Vincent-Maladiere commented 1 year ago

This PR will probably fix https://github.com/aphp/eds-scikit/issues/36

Thomzoy commented 1 year ago

Hi, thanks for the PR!

The main point I'd like to address is that I don't quite see the benefit of using a class since the same result could be achieved with a write_batches_to_hdfs() utils function.

To do so, __init__, get_pd_fragment, and get_pd_table could be merged into one single function.

What do you think?

Yep good call !