Write L2 to file and leave all additional variable derivation for the L2toL3 step

BaptisteVandecrux commented 5 months ago

The idea of this new version is that:

1) L2 data files are written into level_2/raw and level_2/tx folders by get_l2 (just like it was done for the L3 data previously). One consequence is that this low-latency level 2 tx data can be posted very fast on THREDDS for showcase and fieldwork, and processed into BUFR files.

2) L2 tx and raw files are merged using join_l2 (just like it was done for the L3 data previously). Resampling to hourly, daily, monthly values are done here, but could be left for a later stage.

3) get_l3 is now a script that loads the merged L2 file, run process.L2toL3.toL3. This will allow more variables to be derived in toL3 and historical data to be appended once the L3 data is processed.

PennyHow commented 5 months ago

I just pushed changes to the processing performed in the pypromice.process.aws object, mainly for better compatibility (and also in line with Mads' comments).

Merge type for L1A removed (we will take that later)
Rounding, reformatting and resampling steps are moved to aws.writeArr. To put these at the end of Level 2 processing runs the risk of affecting Level 3 processing. Instead, they will now be triggered when a user wants to write a L2/L3 dataset to file
Related to this, aws.writeArr has been altered to accomodate Level 2 and Level 3 dataset exporting. These are then used in the specific-level writing functions writeL2 and writeL3, just to keep it simple

Also, I updated:

pydataverse dependency version is pinned as the most recent version is not compatible for some reason. We'll solve this later
pypromice version increased to v1.4.0... I think we will send out a new release once all the PRs are merged. It is definitely worth a subversion release rather than just a patch release!

I will tackle the CLI scripts now, specifically looking at get_l2, get_l3 and join_l2

PennyHow commented 5 months ago

New commit that tackles get_l2. Instead of having seperate scripts for get_l2 and get_l3 I think it is better if we just add functionality to get_l3 that allows you to only process to level 2 if specified.

I've made changes now that means you can specify the level of processing (and the outputted file) with the input variable --level (or -l); for example:

$ get_l3 -c aws-l0/tx/config/NUK_U.toml -i aws-l0/tx -o aws-l2 -l 2

Where -l 2 triggers aws.getL1(), aws.getL2(), and then writes the Level 3 dataset to file

$ get_l3 -c aws-l0/tx/config/NUK_U.toml -i aws-l0/tx -o aws-l3 -l 3

Where -l 3 triggers aws.getL1(), aws.getL2() and aws.getL3(), and then writes the Level 3 dataset to file.

I have also renamed Baptiste's get_l3 to l2_to_l3 to better represent its functionality. I want to work more on this - pypromice.process.aws should hold all processing and associated handling (e.g. rounding, reformating, resampling), so should call on the functions in the pypromice package, rather than written out separately.

PennyHow commented 5 months ago

Now l2_to_l3 uses the pypromice.process.aws object functionality instead of re-writing the workflow in the CLI script.

https://github.com/GEUS-Glaciology-and-Climate/pypromice/blob/cc6bfb29e6a7f11ade62bad117ddccf176445044/src/pypromice/process/l2_to_l3.py#L55-L69

Effectively the Level 2 .nc file is loaded directly as the Level 2 property of the aws object (i.e. aws.L2 = .NC FILE). So we bypass the Level 0 to Level 2 processing steps in the aws object.

The only thing I am not sure about is the efficiency of re-loading the netcdf file. I'm curious to see what @ladsmund thinks about this. Please feel free to revert the change if you have something else in mind.

BaptisteVandecrux commented 4 months ago

Hi @PennyHow ,

Thanks for the suggestions.

Although your get_l3 is more elegant on the code side, I'm afraid we need to slice things in different scripts because of operational constrains:

we need a get_l2 that processes transmission from L0 to a L2 file which then can be used by the BUFR processing
we can significantly speed up things if we: 1) run get_l2 in parallel for different stations (for tx first, and after BUFR processing, for raw files) 2) than the L2toL3 can be run only after the L2_tx and L2_raw have been merged, so we don't make the processing twice for their overlapping (note that removing those overlapping period between tx and raw is not the topic of this PR and could be addressed later) 3) the join_l3 can be done in parallel for different sites (with each site having a list of stations defined in a config file)

I try to summarize these constrains and make a draft of a l3_processor.sh that would use these functionalities:

I hope that clarifies as well the level definition for @ladsmund.

As a side note, since we'll need to update l3_processor.sh, I find the the latest version (where functionalities are in separate shell files) harder to read and to debug (more chances for i/o errors when calling another file).

ladsmund commented 4 months ago

I also wondered about the purpose of the get_l3 script. As I see it, it is a script that processes all data from AWS through the pipeline. So far, it has only been from a single data source such as tx or raw.

It is not clear to me, if we are on the same page with respect to the processing pipeline and how to invoke which steps. As I see it, the current pipeline is something like

Version 1

getL0tx
getL3 tx_files
generate and publish BUFR files
getL3 raw_files
joinL3 This includes a resampling into monthly, daily and hourly
Publish to fileshare

After this pull request, I could be something like

Version 2

getL0tx
getL2 tx_files
generate and publish BUFR files
getL2 raw_files
joinL2 l2_tx_files l2_raw_files
getL3 l2_joined_files
Publish to fileshare

Or if we still need the L3 tx files for backwards compatibility

Version 3

getL0tx
getL2 tx_files
generate and publish BUFR files
getL3 l2_tx_files
getL2 raw_files
joinL2 l2_tx_files l2_raw_files
getL3 l2_joined_files
Publish to fileshare

And maybe also with the historical data

getL2 hostical_files
getL3 l2_historical_files
joinL3 l3_files l3_historical_files

ladsmund commented 4 months ago

I am not sure about how we should interpet L3 vs Level3 both now and in the future.

I think Baptiste has a good point about using site vs station names as references for datasets.

I could imagine an approach where we have multiple types of l3 products:

Data source level

QAS_L_V3_tx
QAS_L_V3_raw
QAS_L_V2_tx
QAS_L_V2_raw
QAS_L_Historic_raw

Station level

QAS_L_V3
QAS_L_V2

Site level

QAS_L

PennyHow commented 4 months ago

we need a get_l2 that processes transmission from L0 to a L2 file which then can be used by the BUFR processing

I added flexibility into the get_l3 function to also allow for L0 to L2 processing. This can be defined with the -l/--level option set to 2, like so:

$ get_l3 -c aws-l0/tx/config/NUK_U.toml -i aws-l0/tx -o aws-l2 -l 2

So yes, it is still called "get_l3" currently, but can also be used for the functionality you described in get_l2. I'm happy to rename it. I just want to keep the option of L0-to-L3 processing for development or other uses that we may need in the future.

The reason I am hesitant about your configuration is because a lot of the post-processing functionality (i.e. rounding, re-formating, resampling) is re-written from the pypromice.process.aws module into the CLI script. By having two instances of this functionality, it means we have to update and maintain it in both places which is a lot more work for us.

Version 3

getL0tx getL2 tx_files generate and publish BUFR files getL3 l2_tx_files getL2 raw_files joinL2 l2_tx_files l2_raw_files getL3 l2_joined_files Publish to fileshare

So currently this is what I think we are aiming for. With the current pypromice modules/CLI scripts, it should look like this:

get_l0tx (Level 0 tx message fetching)
get_l3 -l 2 (For Level 0 to Level 2 tx processing)
get_bufr (For tx BUFR file generation and upload)
get_l3 -l 2 (For Level 0 to Level 2 raw processing)
join_l2 (For Level 2 raw and tx joining)
l2_to_l3 (For Level 2 to Level 3 processing)
Publish to fileshare

I think we are close!

PennyHow commented 4 months ago

There are a lot of new commits here now, but most of them are associated with de-bugging of a new Action for testing the get_l2 and get_l2tol3 CLI scripts.

Main changes

The long list of functions in pypromice.process.aws have been moved out to separate pypromice.process submodules:

pypromice.process.write contains all file loading functions (writeAll(), writeCSV(). writeNC(), getColNames())
pypromice.process.resample contains all resampling functions (resample_dataset(), calculateSaturationVaporPressure())
pypromice.process.test contains all unit tests, i.e. the TestProcess class
pypromice.process.utilities contains all formatting, dataset populating, and metadata handling (roundValues(), reformat_time(), reformat_lon(), popCols(), addBasicMeta(), populateMeta(), addVars(), addMeta()
pypromice.process.load contains all functions for writing L2/L3 datasets (getConfig(), getL0(), getVars(), getMeta())

All key functionality in the pypromice.process.aws.AWS class have been moved out to the respective submodules. The main one is pypromice.process.write.prepare_and_write(), which prepares a L2/L3 dataset with resampling, rounding and reformating, and then writes out to file. This is now adopted in the CLI scripts, either from the function in the pypromice.process.write module (pypromice.process.write.prepare_and_write() or the pypromice.process.aws.AWS class function which calls upon this (aws.writeArr())
get_l2 and get_l3 CLI scripts now exist for separate L0-to-L2 and L0-to-L3 processing. The get_l2tol3 CLI script performs L2-to-L3 processing
When updating the join CLI scripts (join_l2 and join_l3), I found that they were both exactly the same. I don't know if I am missing something, but with the new changes in pypromice, it seems that most of the functionality for differentiating L2 and L3 datasets is in pypromice. I have renamed the join CLI script now to join_levels, and this should be usable on joining L2 datasets, and joining L3 datasets
I added a new Action (.github/workflows/process_l2_test.yml) to test the get_l2 CLI script. I wanted to add the get_l2tol3 script to this also, but had problems with directory structuring. I can try again another time.

To-do

Check that the entire processing pipeline can work with this new structuring. I've tested each part in isolation, but it would be good to make sure that everything runs smoothly together
Check that all needed functionality has been transferred out of the pypromice.process.aws.AWS class
Check that all new pypromice.process submodules make sense in terms of naming conventions and functionality
Implement get_l2tol3 testing in Action .github/workflows/process_l2_test.yml

And also to see what you guys think about these changes. Please feel free to modify.

BaptisteVandecrux commented 4 months ago

So the updated structure is:

I have now made an update of aws-operational-processing that uses the functions from this PR.

The code has been running on glacio01 and posting the level_2 and level_3 on GitHub (if cloning level_3, make sure to use --depth 1).

The comparison between the aws-l3-dev (new) to aws-l3 (old) is available as timeseries plots or as scatter plots. All variables are identical except: q_h_u, q_h_l, dshf_h_u, dshf_h_l, dlhf_h_u, dlhf_h_l because in the previous version they were calculated from 10 minute data and then averaged, while now they are calculated from hourly averages directly.

I'll be rebasing downstream PRs to this one and we can take the discussion to the next one, which is #252

PennyHow commented 4 months ago

Looks good @BaptisteVandecrux. Do you want us to begin looking at #252 and adding comments, or do you have some things to do on it beforehand?

BaptisteVandecrux commented 4 months ago

Looks good @BaptisteVandecrux. Do you want us to begin looking at #252 and adding comments, or do you have some things to do on it beforehand?

I need to adapt the scripts first now that I have a clearer idea about the structure. I'll let you know when it's ready!

GEUS-Glaciology-and-Climate / pypromice

Write L2 to file and leave all additional variable derivation for the L2toL3 step #255