Closed BaptisteVandecrux closed 4 months ago
I just pushed changes to the processing performed in the pypromice.process.aws
object, mainly for better compatibility (and also in line with Mads' comments).
aws.writeArr
. To put these at the end of Level 2 processing runs the risk of affecting Level 3 processing. Instead, they will now be triggered when a user wants to write a L2/L3 dataset to fileaws.writeArr
has been altered to accomodate Level 2 and Level 3 dataset exporting. These are then used in the specific-level writing functions writeL2
and writeL3
, just to keep it simpleAlso, I updated:
pypromice version increased to v1.4.0... I think we will send out a new release once all the PRs are merged. It is definitely worth a subversion release rather than just a patch release!
I will tackle the CLI scripts now, specifically looking at get_l2
, get_l3
and join_l2
New commit that tackles get_l2
. Instead of having seperate scripts for get_l2
and get_l3
I think it is better if we just add functionality to get_l3
that allows you to only process to level 2 if specified.
I've made changes now that means you can specify the level of processing (and the outputted file) with the input variable --level
(or -l
); for example:
$ get_l3 -c aws-l0/tx/config/NUK_U.toml -i aws-l0/tx -o aws-l2 -l 2
Where -l 2
triggers aws.getL1()
, aws.getL2()
, and then writes the Level 3 dataset to file
$ get_l3 -c aws-l0/tx/config/NUK_U.toml -i aws-l0/tx -o aws-l3 -l 3
Where -l 3
triggers aws.getL1()
, aws.getL2()
and aws.getL3()
, and then writes the Level 3 dataset to file.
I have also renamed Baptiste's get_l3
to l2_to_l3
to better represent its functionality. I want to work more on this - pypromice.process.aws
should hold all processing and associated handling (e.g. rounding, reformating, resampling), so should call on the functions in the pypromice package, rather than written out separately.
Now l2_to_l3
uses the pypromice.process.aws
object functionality instead of re-writing the workflow in the CLI script.
Effectively the Level 2 .nc file is loaded directly as the Level 2 property of the aws object (i.e. aws.L2 = .NC FILE
). So we bypass the Level 0 to Level 2 processing steps in the aws
object.
The only thing I am not sure about is the efficiency of re-loading the netcdf file. I'm curious to see what @ladsmund thinks about this. Please feel free to revert the change if you have something else in mind.
Hi @PennyHow ,
Thanks for the suggestions.
Although your get_l3
is more elegant on the code side, I'm afraid we need to slice things in different scripts because of operational constrains:
get_l2
that processes transmission from L0 to a L2 file which then can be used by the BUFR processingget_l2
in parallel for different stations (for tx first, and after BUFR processing, for raw files)
2) than the L2toL3 can be run only after the L2_tx and L2_raw have been merged, so we don't make the processing twice for their overlapping (note that removing those overlapping period between tx and raw is not the topic of this PR and could be addressed later)
3) the join_l3 can be done in parallel for different sites (with each site having a list of stations defined in a config file)I try to summarize these constrains and make a draft of a l3_processor.sh
that would use these functionalities:
I hope that clarifies as well the level definition for @ladsmund.
As a side note, since we'll need to update l3_processor.sh
, I find the the latest version (where functionalities are in separate shell files) harder to read and to debug (more chances for i/o errors when calling another file).
I also wondered about the purpose of the get_l3 script. As I see it, it is a script that processes all data from AWS through the pipeline. So far, it has only been from a single data source such as tx or raw.
It is not clear to me, if we are on the same page with respect to the processing pipeline and how to invoke which steps. As I see it, the current pipeline is something like
Version 1
getL0tx
getL3 tx_files
getL3 raw_files
joinL3
This includes a resampling into monthly, daily and hourlyAfter this pull request, I could be something like
Version 2
getL0tx
getL2 tx_files
getL2 raw_files
joinL2 l2_tx_files l2_raw_files
getL3 l2_joined_files
Or if we still need the L3 tx files for backwards compatibility
Version 3
getL0tx
getL2 tx_files
getL3 l2_tx_files
getL2 raw_files
joinL2 l2_tx_files l2_raw_files
getL3 l2_joined_files
And maybe also with the historical data
getL2 hostical_files
getL3 l2_historical_files
joinL3 l3_files l3_historical_files
I am not sure about how we should interpet L3 vs Level3 both now and in the future.
I think Baptiste has a good point about using site vs station names as references for datasets.
I could imagine an approach where we have multiple types of l3 products:
Data source level
QAS_L_V3_tx
QAS_L_V3_raw
QAS_L_V2_tx
QAS_L_V2_raw
QAS_L_Historic_raw
Station level
QAS_L_V3
QAS_L_V2
Site level
QAS_L
we need a get_l2 that processes transmission from L0 to a L2 file which then can be used by the BUFR processing
I added flexibility into the get_l3 function to also allow for L0 to L2 processing. This can be defined with the -l
/--level
option set to 2
, like so:
$ get_l3 -c aws-l0/tx/config/NUK_U.toml -i aws-l0/tx -o aws-l2 -l 2
So yes, it is still called "get_l3" currently, but can also be used for the functionality you described in get_l2. I'm happy to rename it. I just want to keep the option of L0-to-L3 processing for development or other uses that we may need in the future.
The reason I am hesitant about your configuration is because a lot of the post-processing functionality (i.e. rounding, re-formating, resampling) is re-written from the pypromice.process.aws module into the CLI script. By having two instances of this functionality, it means we have to update and maintain it in both places which is a lot more work for us.
Version 3
getL0tx getL2 tx_files generate and publish BUFR files getL3 l2_tx_files getL2 raw_files joinL2 l2_tx_files l2_raw_files getL3 l2_joined_files Publish to fileshare
So currently this is what I think we are aiming for. With the current pypromice modules/CLI scripts, it should look like this:
I think we are close!
There are a lot of new commits here now, but most of them are associated with de-bugging of a new Action for testing the get_l2
and get_l2tol3
CLI scripts.
Main changes
pypromice.process.aws
have been moved out to separate pypromice.process
submodules:pypromice.process.write
contains all file loading functions (writeAll()
, writeCSV()
. writeNC()
, getColNames()
)pypromice.process.resample
contains all resampling functions (resample_dataset()
, calculateSaturationVaporPressure()
)pypromice.process.test
contains all unit tests, i.e. the TestProcess
classpypromice.process.utilities
contains all formatting, dataset populating, and metadata handling (roundValues()
, reformat_time()
, reformat_lon()
, popCols()
, addBasicMeta()
, populateMeta()
, addVars()
, addMeta()
pypromice.process.load
contains all functions for writing L2/L3 datasets (getConfig()
, getL0()
, getVars()
, getMeta()
)All key functionality in the pypromice.process.aws.AWS
class have been moved out to the respective submodules. The main one is pypromice.process.write.prepare_and_write()
, which prepares a L2/L3 dataset with resampling, rounding and reformating, and then writes out to file. This is now adopted in the CLI scripts, either from the function in the pypromice.process.write
module (pypromice.process.write.prepare_and_write()
or the pypromice.process.aws.AWS
class function which calls upon this (aws.writeArr()
)
get_l2
and get_l3
CLI scripts now exist for separate L0-to-L2 and L0-to-L3 processing. The get_l2tol3
CLI script performs L2-to-L3 processing
When updating the join CLI scripts (join_l2
and join_l3
), I found that they were both exactly the same. I don't know if I am missing something, but with the new changes in pypromice, it seems that most of the functionality for differentiating L2 and L3 datasets is in pypromice. I have renamed the join CLI script now to join_levels
, and this should be usable on joining L2 datasets, and joining L3 datasets
I added a new Action (.github/workflows/process_l2_test.yml
) to test the get_l2
CLI script. I wanted to add the get_l2tol3
script to this also, but had problems with directory structuring. I can try again another time.
To-do
pypromice.process.aws.AWS
classpypromice.process
submodules make sense in terms of naming conventions and functionalityget_l2tol3
testing in Action .github/workflows/process_l2_test.yml
And also to see what you guys think about these changes. Please feel free to modify.
So the updated structure is:
I have now made an update of aws-operational-processing that uses the functions from this PR.
The code has been running on glacio01 and posting the level_2 and level_3 on GitHub (if cloning level_3
, make sure to use --depth 1).
The comparison between the aws-l3-dev
(new) to aws-l3
(old) is available as timeseries plots or as scatter plots. All variables are identical except: q_h_u
, q_h_l
, dshf_h_u
, dshf_h_l
, dlhf_h_u
, dlhf_h_l
because in the previous version they were calculated from 10 minute data and then averaged, while now they are calculated from hourly averages directly.
I'll be rebasing downstream PRs to this one and we can take the discussion to the next one, which is #252
Looks good @BaptisteVandecrux. Do you want us to begin looking at #252 and adding comments, or do you have some things to do on it beforehand?
Looks good @BaptisteVandecrux. Do you want us to begin looking at #252 and adding comments, or do you have some things to do on it beforehand?
I need to adapt the scripts first now that I have a clearer idea about the structure. I'll let you know when it's ready!
The idea of this new version is that:
1) L2 data files are written into
level_2/raw
andlevel_2/tx
folders by get_l2 (just like it was done for the L3 data previously). One consequence is that this low-latency level 2tx
data can be posted very fast on THREDDS for showcase and fieldwork, and processed into BUFR files.2) L2
tx
andraw
files are merged using join_l2 (just like it was done for the L3 data previously). Resampling to hourly, daily, monthly values are done here, but could be left for a later stage.3) get_l3 is now a script that loads the merged L2 file, run
process.L2toL3.toL3
. This will allow more variables to be derived intoL3
and historical data to be appended once the L3 data is processed.