CADWRDeltaModeling / dms_datastore

Data download and management tools for continuous data for Pandas. See documentation https://cadwrdeltamodeling.github.io/dms_datastore/
https://cadwrdeltamodeling.github.io/dms_datastore/
MIT License
1 stars 0 forks source link

read_ts should respect flags when reading dms formatted data #20

Closed dwr-psandhu closed 11 months ago

dwr-psandhu commented 1 year ago

read_ts method is used when reading formatted and screened formats. This format allows for a user flag column to mark "bad" data. (Missing or bad are the only values). The current read_ts method should use this flag to screen out values. In addition, the read_ts should have the option to show screened values when necessary.

water-e commented 11 months ago

There is really two things, and amounts to two things. The part where we apply user_flags and report the steps as nan like the other readers, is now done so please test. However the existing behavior never had any option to return flags or to not mask data flagged as bad.

Not doing the masking would be a simple addition ... not one I would want a lot of people to use for most other formats. You would add a conditional switch code around line 1040:

            if blank_qaqc_good: qaqc_accept += [np.NaN]
            try:
                dset.loc[~dset[:,qaqc_selector].isin(qaqc_accept), selector] = np.nan
            except:
                for v,f in zip(selector,qaqc_selector):
                    dset.loc[~dset[f].isin(qaqc_accept), v] = np.nan

The tedious step is to add it as an option in csv_retrieve_ts and then start working on all the readers. That is a pain for something that has a default, so I'm working on a class-based impl for read_ts. That will make it easier to add new functionality without sending the plumbing all the way up the call chain.

water-e commented 11 months ago

Adding an option to retrieve user_flags is more fraught. I'm not sure I want this to be an option for the provider flags, and the design that might do so maybe with a unified flag sounds like the convoluted logic I'd like to avoid.

I wonder if it would be more pragmatic to write a different reader? I have a writer? Reading our format is incredibly easy ... practically nothing except maybe a dtype = { "user_flag": str}.