USGS-CMG / stglib

Routines used by the USGS Coastal/Marine Hazards & Resources Program to process oceanographic time-series data
Other
19 stars 15 forks source link

Capability to do qaqc without recreating -raw.cdf file #87

Open ssuttles-usgs opened 2 years ago

ssuttles-usgs commented 2 years ago

Presently in the stglib workflow any qaqc actions to data variables are specified in the config.yaml file, which is ingested as an argument at the first processing step where the raw instrument data are read and written to a raw.cdf file. It would be desirable to have the added capability to allow qaqc actions to be specified at later steps in the process, so that the raw,cdf file would not need to be recreated each time. One idea that has been discussed would be to allow a new qaqc.yaml file, containing qaqc actions, as an optional argument at the step(s) where the .nc files for data release are generated (e.g runexocdf2nc.py). This could be implemented in a similar way to the optional atmospheric pressure correction argument (--atmpres ) that is used to correct submerged pressure data for changes in local atmospheric pressure.

dnowacki-usgs commented 11 months ago

I think we could keep everything in the same yaml file and read it in again, as suggested, using an additional command-line argument. We would want to check when re-reading the file to make sure the new values are not different from what already exists and if so issue a warning (but probably not fail, since there may be some testing of the ideal qaqc cutoff values).

ssuttles-usgs commented 11 months ago

That does seem like a potentially good solution within the existing workflow. Do you know if there is a way to just to just work with the global attributes of an existing netCDF file (read & write)? For very large files the re-writing of the entire raw,cdf file can take a very long time.

dnowacki-usgs commented 11 months ago

An xr.open_dataset() should just open the dataset but not load any of the data values. I think this means that accessing the attrs is quick. For writing, I think(?) it has to rewrite the whole file, but I'm not certain about that.

Edit: that seems to be the case for xarray. https://stackoverflow.com/questions/66231575/xarray-appending-or-rewriting-a-existing-nc-file

We'd be reading in the whole CDF in this scenario anyway though, right?

ssuttles-usgs commented 11 months ago

The writing the whole file part is the thing that can be slow for large files. Since we only need to append some attributes to the global attributes that will invoke QA/QC that we want to perform in the cdf2nc step, I am hoping there is a way to do that without re-writing the whole raw CDF file.

ncatted or other things in NCO might work better than xarray, but don't honestly know until trying.

https://nco.sourceforge.net/nco.html#ncatted

https://stackoverflow.com/questions/69043727/how-can-i-add-or-edit-lot-of-global-attributes-with-ncatted

I am happy to save this issue until we are in the QA/QC phase for a very large dataset. I am not there yet!

dnowacki-usgs commented 11 months ago

Probably not necessary to alter the raw CDF if we are loading in the raw and the re-reading the config yaml during cdf2nc. We can then write out just the cleaned nc with the updated attrs, right? They won’t match between raw CDF and clean nc, but I think that’s okay… I hope?

From: Steven Suttles @.> Reply-To: USGS-CMG/stglib @.> Date: Wednesday, January 3, 2024 at 2:21 PM To: USGS-CMG/stglib @.> Cc: Daniel Nowacki @.>, Comment @.***> Subject: [EXTERNAL] Re: [USGS-CMG/stglib] Capability to do qaqc without recreating -raw.cdf file (Issue #87)

This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.

The writing the whole file part is the thing that can be slow for large files. Since we only need to append some attributes to the global attributes that will invoke QA/QC that we want to perform in the cdf2nc step, I am hoping there is a way to do that without re-writing the whole raw CDF file.

ncatted or other things in NCO might work better than xarray, but don't honestly know until trying.

https://nco.sourceforge.net/nco.html#ncatted

https://stackoverflow.com/questions/69043727/how-can-i-add-or-edit-lot-of-global-attributes-with-ncatted

I am happy to save this issue until we are in the QA/QC phase for a very large dataset. I am not there yet!

— Reply to this email directly, view it on GitHubhttps://github.com/USGS-CMG/stglib/issues/87#issuecomment-1876046118, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADJSL7IXJUAGIP22JG6MBZLYMXKUDAVCNFSM6AAAAAARVNR73OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZWGA2DMMJRHA. You are receiving this because you commented.Message ID: @.***>

ssuttles-usgs commented 11 months ago

Ok, now I get it. Yes, re-reading the yaml file with a command line option in the cdf2nc step with the added qaqc calls seems like a great solution! Sorry I did not see that was what you meant before. Addinf a user warning if any pre-existing key word values have changed seems like a good idea. This would allow the option of making a change to instrument metadata in the final step if needed. Alternatively, could only allow non-existing keys/values to be added, but I prefer the approach with the flexibility and user warning as you suggest. As for timing, if you would like me to implement this capability I will probably wait until I start processing datasets I have that contain very large data files. I suspect in about a month. If you want to go ahead and make the change sooner, that would be great too!