alchemistry / alchemlyb

the simple alchemistry library
https://alchemlyb.readthedocs.io
BSD 3-Clause "New" or "Revised" License
189 stars 49 forks source link

parquet parser for dataframe serialisation #317

Closed xiki-tempula closed 1 year ago

xiki-tempula commented 1 year ago

This PR add a parquet parser, which will load the serialised dataframe and permits the usage of

    from alchemlyb.parsing.parquet import extract_dHdl, extract_u_nk
    import pandas as pd

    u_nk.to_parquet(path='u_nk.parquet', index=True)
    dHdl.to_parquet(path='dHdl.parquet', index=True)

    new_u_nk = extract_u_nk('u_nk.parquet', T=300)
    new_dHdl = extract_dHdl('dHdl.parquet', T=300) 

Fix https://github.com/alchemistry/alchemlyb/issues/316

codecov[bot] commented 1 year ago

Codecov Report

Merging #317 (7347a9c) into master (4e590cc) will increase coverage by 0.01%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #317      +/-   ##
==========================================
+ Coverage   98.74%   98.75%   +0.01%     
==========================================
  Files          26       27       +1     
  Lines        1754     1772      +18     
  Branches      382      387       +5     
==========================================
+ Hits         1732     1750      +18     
  Misses          2        2              
  Partials       20       20              
Impacted Files Coverage Δ
src/alchemlyb/parsing/parquet.py 100.00% <100.00%> (ø)
src/alchemlyb/workflows/abfe.py 99.67% <100.00%> (+<0.01%) :arrow_up:
xiki-tempula commented 1 year ago

@orbeckst Do you mind have a review of this PR, please? Thank you.

xiki-tempula commented 1 year ago

@orbeckst Thanks for the review. I have change the environment.yml. I will raise an issue with meta.yml when we are doing a release.

So the T problem is a bit tricky in that parquet serialisation doesn't preserve the df.attrs. So the dataframe loaded here doesn't contain temperature and the temperature is assigned via extract function. I will add a not to state that.

orbeckst commented 1 year ago

So the T problem is a bit tricky in that parquet serialisation doesn't preserve the df.attrs. So the dataframe loaded here doesn't contain temperature and the temperature is assigned via extract function. I will add a not to state that.

Ok, I blithely assumed that parquet would save everything — then T is not optional. Definitely document this requirement.

xiki-tempula commented 1 year ago

@orbeckst

Ok, I blithely assumed that parquet would save everything

I'm annoyed by this as well but it seems that parquet is still the best serialiser. This is the only format that preserves index besides to_pickle. Also it faithfully preserve all the data in its original datatype. The to_pickle, though it preserves everything, is slow to read and write and also would not be safe between different versions of pandas.

Definitely document this requirement.

I have added this as a note to both alchemlyb.parsing.parquet.extract_u_nk and alchemlyb.parsing.parquet.extract_dHdl.

orbeckst commented 1 year ago

Let me know when I need to review again.

xiki-tempula commented 1 year ago

@orbeckst I addressed the comments. Do you mind having another review? Thank you.

orbeckst commented 1 year ago

Please squash-merge when ready.