Consider Switching Pandas to Polars

avalentino / s1isp

Sentinel-1 Instrument Source Packets decoder

Apache License 2.0

5 stars 1 forks source link

Consider Switching Pandas to Polars #3

Closed sirbastiano closed 4 months ago

sirbastiano commented 4 months ago

Dear Antonio, as efficiency is our top priority, I suggest switching to polars: https://pola.rs/

Let me know your thoughts on this.

Article: https://www.datacamp.com/tutorial/high-performance-data-manipulation-in-python-pandas2-vs-polars?dc_referrer=https%3A%2F%2Fduckduckgo.com%2F

avalentino commented 4 months ago

Dear @sirbastiano thanks a lot for the suggestion. I didn't know polars but it looks very interesting. I need to read carefully the documentation and the kink you shared.

Regarding the use of pandas in this project, it is very limited, only the dump_records function uses it, and, actually, it could even be a totally optional dependency. I also use it in the example Jupyter notebook, if I remember correctly, but mostly to have a nice representation of tabular data.

Considering that pandas is more widely used than polars I would tend to have an optional dependency form both of them to allow the users to choose whatever they prefer.

Would you consider to submit a small PR going in this direction? The main change would be to add third output_format option in dump_records.

sirbastiano commented 4 months ago

Well, it is a matter of choice, potentially you can store the echo data row by row in the dataframe and make computations on that (Imagine range compression row by row).

Polars is much more better doing that, it employs parallelization and hardware efficiency.

We can keep both potentially, and keeping them as optional dependencies.

avalentino commented 4 months ago

Ah sorry, I thought that you was talking about headers data. For echo data the long term idea is to use a format that goes in the direction of the one being developed in the CGS re-engineering and based on zarr. But I need to think a little bit more about it. I'm open to discussions of course.

sirbastiano commented 4 months ago

My idea is to exploit parallelism of GPU and threads.

By the way, are you in ESRIN??

Let's take a coffee together!

Il giorno mar 2 lug 2024 alle ore 14:30 Antonio Valentino < @.***> ha scritto:

Ah sorry, I thought that you was talking about headers data. For echo data the long term idea is to use a format that goes in the direction of the one being developed in the CGS re-engineering and based on zarr. But I need to think a little bit more about it. I'm open to discussions of course.

— Reply to this email directly, view it on GitHub https://github.com/avalentino/s1isp/issues/3#issuecomment-2203039505, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARFBHLSF432VYUDH7ONGWJLZKKMPRAVCNFSM6AAAAABKG66BDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBTGAZTSNJQGU . You are receiving this because you were mentioned.Message ID: @.***>

avalentino commented 4 months ago

Can this be closed now?

sirbastiano commented 4 months ago

Yes

Il giorno ven 5 lug 2024 alle 00:49 Antonio Valentino < @.***> ha scritto:

Can this be closed now?

— Reply to this email directly, view it on GitHub https://github.com/avalentino/s1isp/issues/3#issuecomment-2209623150, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARFBHLQZ5JI4GK3TPATLQNTZKXGOLAVCNFSM6AAAAABKG66BDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBZGYZDGMJVGA . You are receiving this because you were mentioned.Message ID: @.***>