Following discussions with Rachel, we've confirmed that input netCDF files which were used as the start of the current pipeline and feed into preprocessing/xbt_extract_year.py, are based data downloaded from WOD. The EN4 dataset uses other data sources which are differently formatted. The current definition of the input interface to the ML pipeline of year CSV files is a useful one, and so rather than change that, we should update the preprocessing functionality to take in different formats and output yearly CSV files. Among the tasks will be
read in XBT dprofiles from non-WOD sources
extract the important variables
split into yeats
add a mode to append data to existing where required. This may be useful so we can run the WOD preprocessing and produce files, then run preprprocessing for other sources one at a time and append to relevant CSV file for a particular file.
Following discussions with Rachel, we've confirmed that input netCDF files which were used as the start of the current pipeline and feed into preprocessing/xbt_extract_year.py, are based data downloaded from WOD. The EN4 dataset uses other data sources which are differently formatted. The current definition of the input interface to the ML pipeline of year CSV files is a useful one, and so rather than change that, we should update the preprocessing functionality to take in different formats and output yearly CSV files. Among the tasks will be