LinkedEarth / pylipd

Development repository for Python LiPD utilities
https://pylipd.readthedocs.io/en/latest/
Apache License 2.0
2 stars 0 forks source link

Easily filter out time-related paleoData_variable instances #49

Closed CommonClimate closed 1 year ago

CommonClimate commented 1 year ago

When loading LiPD files, pyLipd (as did the LiPD utilities before it) currently saddles the lipd object with a bunch of utterly useless entries. For instance, when running this snippet:

from pylipd.utils.dataset import load_dir
lipd = load_dir(name='Pages2k')
df = lipd.get_timeseries_essentials()
df['paleoData_variableName']

We see that approximately half of the entries are "year", which means that the corresponding series is basically "x = year, y = year", which only contributes to confuse users and burden the RAM. There needs to be a way to banish these things. Either:

  1. implement a rule in get_timeseries_essentials() that if year, age is in lower(paleoData_variableName) then we chuck the series. At the very least, it could be done through a boolean flag set to False by default if you are worried about chucking potentially valuable series.
  2. implement a function called paleoData_cleanup() that filters either the lipd object or the dataframe obtained from get_timeseries_essentials()

I would view it as one of the main improvements of pyLipd over its predecessor if it alleviated the need to manually remove these garbage series.

khider commented 1 year ago

Done