Creation of functions to read non-WOD files

CARSv2 / cars-v2

CARSv2 project repository - public

MIT License

4 stars 1 forks source link

Creation of functions to read non-WOD files #9

Open BecCowley opened 1 year ago

BecCowley commented 1 year ago

I need some clarity on the output from reading non-WOD files.

My current thoughts is to write a function for each file type and output the same structure (perhaps a dataframe) for each. Then any notebook could call the function to access the data ready for input into the mapping.

Question: are we creating our tools as notebooks as a standard, or ultimately producing *.py functions/definitions/tools?

Some notes:

CTD files available on THREDDS server at AODN could be aggregated using similar code to Michael Hemming's aggregation code: https://github.com/mphemming/QAQC-Summit-2023
ANMN moorings files: we could start with the aggregated product that is already available on AODN.
Chris has started some code to read the MNF and AIMS CTD datasets. I can edit this to create a function with a dataframe output.
Need to define what the dataframe or output items should look like. Bare minimum would be latitude, longitude, date/time, temperature, depth, quality flags (or apply these and remove bad data before output).
Other variables that might be useful: datasource (filename), instrument type, .......

Thomas-Moore-Creative commented 1 year ago

Question: are we creating our tools as notebooks as a standard, or ultimately producing *.py functions/definitions/tools?

@BecCowley - I'm not clear on what @ChrisC28 thinks here but in general I use notebooks for all my code development. As a final product notebooks are useful for code workflow documentation, examples, and sharing results.

For any often repeated tasks I usually take a simple approach of defining python functions - first right in the notebooks themselves and then collecting them in a local tools.py file (or files) that I can import as needed.

I'm not an expert on making proper python packages but I'd suggest that as we all build discrete functions the easiest way is to place mature functions into a shared tools.py file - or whatever is equivalent for julia?

BecCowley commented 1 year ago

@Thomas-Moore-Creative, thanks for this information. I'm happy to make notebooks, I haven't done it much before and hence my question about how they function together. I will follow your advice! I don't know how tools.py files work, I'll see what I can find out, but would be happy to hear you explain it to me.

Here are some notes from @ChrisC28 via email:

Here's my current (basic) workflow with the WOD data in ragged array format: For each platform type (CTD, PFL, XBT,....) and for a given variable (say Temperature) use a bit of magic to tag every obs value with a profile index; Filter out the bad profiles (note, I haven't done this yet... it's on the to-do list); For each profile, use the fancy TEOS10 vertical interpolation to put the profile on a set of a standard levels (at the moment, just every 10m or something); Store the interpolated profile in a netcdf file with the dimensions (cast,depth); The above is a sketch, but you get the idea.

So, the dilemma I have is do we put the non-WOD data : in WOD ragged array format; or save the profiles directly (ie with dimensions (cast, depth) I'm leading towards the WOD format - it's what the data assimilation people tend to use, and it allows me to run the data through the exact same processing routines that we use for the WOD data.

As such, here's what I propose: A reader for each file type that takes (for example) the AIMS data, processes it following your magic (adds QC flags where required, etc...) spits it out into WOD ragged array format.

Thomas-Moore-Creative commented 1 year ago

@Thomas-Moore-Creative, thanks for this information. I'm happy to make notebooks, I haven't done it much before and hence my question about how they function together. I will follow your advice! I don't know how tools.py files work, I'll see what I can find out, but would be happy to hear you explain it to me.

Again, I'll note that my approaches might not be best-practice but suggesting you start testing functions in notebooks then once you are confident about a function you can put it into a my_functions.py file for general import into any notebook or python code? This below might help more?

You can define a local function in your Python Jupyter notebook by simply defining the function in a code cell. The function will then be available for use in subsequent cells. To call the function, simply include its name followed by parentheses and any required arguments. For example:

def my_function(arg1, arg2):
    # do something
    return result

my_function(value1, value2)

Note that local functions are only available within the same notebook where they are defined.

To import functions from a local file, you can use the import statement followed by the name of the file (without the .py extension) and the name of the function. For example, if you have a file my_functions.py that contains a function my_function, you can import it using:

from my_functions import my_function

Then you can call the function using my_function() in your code.

BecCowley commented 1 year ago

@Thomas-Moore-Creative, thanks. Fairly straightforward, then!

BecCowley commented 1 year ago

List of issues that @BecCowley ran into while doing the conversion of AIMS csv CTD files from CSV to NETCDF format:

The csv files provided by AIMS (extracted from their database) do not have a TIME variable, therefore only a DATE is associated with the data. Eight files didn't even have a date in the header information (eg, 'AimsWqCtdDataStation-KIM163.csv')
There are no QC flags, so I don't know if the data has been QC'd or not
There is very little metadata associated with each file - no ship name, PI name, instrument type etc. For some of the later files, this information is there, but in various text formats.
Reading the units for each parameter is difficult as the text output in the csv file is variable. Parsing will be tricky. For this reason, I didn't include any parameters beyond DEPTH, PRESSURE, TEMPERATURE and SALINITY.
An example of unit variability is for OXYGEN which has units of umol/kg and % saturation, but the values for some of these data are clearly incorrect for the units.
There are other parameters that would take some careful work to extract in a uniform manner.
My own learning curve using python tools for netcdf. I spent a lot of time using xarray only to find that I couldn't manipulate the output to match WOD format (ragged array) or IMOS format (single NC files). The issue was mostly to do with setting dimensions for different variables, setting TIME as a dimensionless variable or TIME with 'casts' as the dimension.
Ultimately, I used the IMOS tools for converting the files to netcdf. They have a python toolbox which uses text files for global attributes and variable attribute setting, which I understood well as I've been using it for XBT data conversions.

BecCowley commented 1 year ago

@ChrisC28 I have converted the MNF CTD data from the CARS region. Location in /oa-decadal-climate/work/observations/CARSv2_ancillary/MNF/NC

Notes:

Data was extracted from the MNF managed oracle database by hand (special request) as the Data Trawler has not been designed for such big extractions
Data was supplied in two .tsv files. One has metadata, the other has actual data by parameter name.
Python code has been built to do the transformation and output is 3862 netcdf files ('MNFtxt2netcdf.py').
QC flags are in place, please use only flags 1, 2 and 5.

Will push my code to the repository with the other converters. They are now in the src/features folder, not in the notebooks.

ChrisC28 commented 1 year ago

@BecCowley Awesome! Thanks for that. I'll try to get them into "the system" this week or next.

Paul Sandry is interested in having those data availble for the ROAM data assimilation system.