GEUS-Glaciology-and-Climate / pypromice

Process AWS data from L0 (raw logger) through Lx (end user)
https://pypromice.readthedocs.io
GNU General Public License v2.0
14 stars 4 forks source link

Write L2 to file and leave all additional variable derivation for the L2toL3 step #255

Closed BaptisteVandecrux closed 4 months ago

BaptisteVandecrux commented 5 months ago

The idea of this new version is that:

1) L2 data files are written into level_2/raw and level_2/tx folders by get_l2 (just like it was done for the L3 data previously). One consequence is that this low-latency level 2 tx data can be posted very fast on THREDDS for showcase and fieldwork, and processed into BUFR files.

2) L2 tx and raw files are merged using join_l2 (just like it was done for the L3 data previously). Resampling to hourly, daily, monthly values are done here, but could be left for a later stage.

3) get_l3 is now a script that loads the merged L2 file, run process.L2toL3.toL3. This will allow more variables to be derived in toL3 and historical data to be appended once the L3 data is processed.

PennyHow commented 5 months ago

I just pushed changes to the processing performed in the pypromice.process.aws object, mainly for better compatibility (and also in line with Mads' comments).

Also, I updated:

PennyHow commented 5 months ago

New commit that tackles get_l2. Instead of having seperate scripts for get_l2 and get_l3 I think it is better if we just add functionality to get_l3 that allows you to only process to level 2 if specified.

I've made changes now that means you can specify the level of processing (and the outputted file) with the input variable --level (or -l); for example:

$ get_l3 -c aws-l0/tx/config/NUK_U.toml -i aws-l0/tx -o aws-l2 -l 2

Where -l 2 triggers aws.getL1(), aws.getL2(), and then writes the Level 3 dataset to file

$ get_l3 -c aws-l0/tx/config/NUK_U.toml -i aws-l0/tx -o aws-l3 -l 3

Where -l 3 triggers aws.getL1(), aws.getL2() and aws.getL3(), and then writes the Level 3 dataset to file.

I have also renamed Baptiste's get_l3 to l2_to_l3 to better represent its functionality. I want to work more on this - pypromice.process.aws should hold all processing and associated handling (e.g. rounding, reformating, resampling), so should call on the functions in the pypromice package, rather than written out separately.

PennyHow commented 5 months ago

Now l2_to_l3 uses the pypromice.process.aws object functionality instead of re-writing the workflow in the CLI script.

https://github.com/GEUS-Glaciology-and-Climate/pypromice/blob/cc6bfb29e6a7f11ade62bad117ddccf176445044/src/pypromice/process/l2_to_l3.py#L55-L69

Effectively the Level 2 .nc file is loaded directly as the Level 2 property of the aws object (i.e. aws.L2 = .NC FILE). So we bypass the Level 0 to Level 2 processing steps in the aws object.

The only thing I am not sure about is the efficiency of re-loading the netcdf file. I'm curious to see what @ladsmund thinks about this. Please feel free to revert the change if you have something else in mind.

BaptisteVandecrux commented 4 months ago

Hi @PennyHow ,

Thanks for the suggestions.

Although your get_l3 is more elegant on the code side, I'm afraid we need to slice things in different scripts because of operational constrains:

I try to summarize these constrains and make a draft of a l3_processor.sh that would use these functionalities: image

I hope that clarifies as well the level definition for @ladsmund.

As a side note, since we'll need to update l3_processor.sh, I find the the latest version (where functionalities are in separate shell files) harder to read and to debug (more chances for i/o errors when calling another file).

ladsmund commented 4 months ago

I also wondered about the purpose of the get_l3 script. As I see it, it is a script that processes all data from AWS through the pipeline. So far, it has only been from a single data source such as tx or raw.

It is not clear to me, if we are on the same page with respect to the processing pipeline and how to invoke which steps. As I see it, the current pipeline is something like

Version 1

  1. getL0tx
  2. getL3 tx_files
  3. generate and publish BUFR files
  4. getL3 raw_files
  5. joinL3 This includes a resampling into monthly, daily and hourly
  6. Publish to fileshare

After this pull request, I could be something like

Version 2

  1. getL0tx
  2. getL2 tx_files
  3. generate and publish BUFR files
  4. getL2 raw_files
  5. joinL2 l2_tx_files l2_raw_files
  6. getL3 l2_joined_files
  7. Publish to fileshare

Or if we still need the L3 tx files for backwards compatibility

Version 3

  1. getL0tx
  2. getL2 tx_files
  3. generate and publish BUFR files
  4. getL3 l2_tx_files
  5. getL2 raw_files
  6. joinL2 l2_tx_files l2_raw_files
  7. getL3 l2_joined_files
  8. Publish to fileshare

And maybe also with the historical data

ladsmund commented 4 months ago

I am not sure about how we should interpet L3 vs Level3 both now and in the future.

I think Baptiste has a good point about using site vs station names as references for datasets.

I could imagine an approach where we have multiple types of l3 products:

Data source level

Station level

Site level

PennyHow commented 4 months ago

we need a get_l2 that processes transmission from L0 to a L2 file which then can be used by the BUFR processing

I added flexibility into the get_l3 function to also allow for L0 to L2 processing. This can be defined with the -l/--level option set to 2, like so:

$ get_l3 -c aws-l0/tx/config/NUK_U.toml -i aws-l0/tx -o aws-l2 -l 2

So yes, it is still called "get_l3" currently, but can also be used for the functionality you described in get_l2. I'm happy to rename it. I just want to keep the option of L0-to-L3 processing for development or other uses that we may need in the future.

The reason I am hesitant about your configuration is because a lot of the post-processing functionality (i.e. rounding, re-formating, resampling) is re-written from the pypromice.process.aws module into the CLI script. By having two instances of this functionality, it means we have to update and maintain it in both places which is a lot more work for us.

Version 3

getL0tx getL2 tx_files generate and publish BUFR files getL3 l2_tx_files getL2 raw_files joinL2 l2_tx_files l2_raw_files getL3 l2_joined_files Publish to fileshare

So currently this is what I think we are aiming for. With the current pypromice modules/CLI scripts, it should look like this:

I think we are close!

PennyHow commented 4 months ago

There are a lot of new commits here now, but most of them are associated with de-bugging of a new Action for testing the get_l2 and get_l2tol3 CLI scripts.

Main changes

  1. pypromice.process.write contains all file loading functions (writeAll(), writeCSV(). writeNC(), getColNames())
  2. pypromice.process.resample contains all resampling functions (resample_dataset(), calculateSaturationVaporPressure())
  3. pypromice.process.test contains all unit tests, i.e. the TestProcess class
  4. pypromice.process.utilities contains all formatting, dataset populating, and metadata handling (roundValues(), reformat_time(), reformat_lon(), popCols(), addBasicMeta(), populateMeta(), addVars(), addMeta()
  5. pypromice.process.load contains all functions for writing L2/L3 datasets (getConfig(), getL0(), getVars(), getMeta())

To-do

And also to see what you guys think about these changes. Please feel free to modify.

BaptisteVandecrux commented 4 months ago

So the updated structure is: billede

I have now made an update of aws-operational-processing that uses the functions from this PR.

The code has been running on glacio01 and posting the level_2 and level_3 on GitHub (if cloning level_3, make sure to use --depth 1).

The comparison between the aws-l3-dev (new) to aws-l3 (old) is available as timeseries plots or as scatter plots. All variables are identical except: q_h_u, q_h_l, dshf_h_u, dshf_h_l, dlhf_h_u, dlhf_h_l because in the previous version they were calculated from 10 minute data and then averaged, while now they are calculated from hourly averages directly.

I'll be rebasing downstream PRs to this one and we can take the discussion to the next one, which is #252

PennyHow commented 4 months ago

Looks good @BaptisteVandecrux. Do you want us to begin looking at #252 and adding comments, or do you have some things to do on it beforehand?

BaptisteVandecrux commented 4 months ago

Looks good @BaptisteVandecrux. Do you want us to begin looking at #252 and adding comments, or do you have some things to do on it beforehand?

I need to adapt the scripts first now that I have a clearer idea about the structure. I'll let you know when it's ready!