IQuOD / Formats

Data format development
1 stars 1 forks source link

How to stucture aggregated files? #4

Open mhidas opened 6 years ago

mhidas commented 6 years ago

The "natural" way to store measured values from multiple profiles is in a two-dimensional array, with a profile index, and a pressure/level index, as is done in the core-Argo format. In the general case where profiles have different numbers of levels, the CF conventions call this incomplete multidimensional array representation. It is simple to read data from such files, but they can be large and contain lots of fill values (because their size is set by the profile with the maximum number of levels). The extra space taken up by the fill values is largely eliminated when the file is compressed, though this comes at the cost slightly increased access time.

The alternative is the contiguous ragged array representation, which packs all the profiles "end-to-end" into a single dimension, removing the need for fill values. Reading values from this structure is slightly more complex because of the need to keep track of the length of each profile.

mhidas commented 6 years ago

One point in favour of the ragged-array option is that WODselect can already output this format, so it would be the easier one to implement. Tim Boyer suggest that we should focus on making sure the software likely to be used to read IQuOD data is able to handle this format. In particular ERDDAP already does.

mhidas commented 6 years ago

On the other hand, the ragged array structure is a step away from the Argo format, which is essentially a multidimensional array stucture. (Even though core-Argo is generally thought of as a single-profile format, it is actually able to hold more than one profile, as it has an N_PROF dimension.)

So the question is, what software is most commonly used to read Argo profile data, and could it handle the ragged array format? (assuming variable names, metadata and other details are the same as Argo)

mhidas commented 6 years ago

I did a simple test to look at the impact of file structure and compression on file size, using the file iquod_xbt_2016.nc** (originally 525Mb). After extracting just the essential variables (lat, lon, time, z, Temperature + their uncertainties & flags), leaving 487Mb in the current format. Then I tried the following, with the resulting file sizes:

So with compression, both versions (ragged array and incomplete-multidimensional) would be about 1/3 the size of the current ragged-array file.

** Note: the file can be obtained from ftp://ftp.nodc.noaa.gov/pub/WOD/SELECT/iquod/netcdf/2016/iquod_xbt_2016.nc . I actually used an earlier version of this file, but I don't think the data in the file would have changed much so the result should be the same.

Thomas-Moore-Creative commented 6 years ago

Aloha @mhidas & @BoyerWOD

I think this is a key discussion - at least from my selfish needs point of view. Why not have the IQuOD multi-cast files in an "Argo-like", compressed, multidimensional format?

In the mean time do we have "approved & tested" python codes to read in the ragged array formats?

My selfish needs are to suck in yearly NC files and build dataframes based on a set of filters to then feed into a QC pipeline or reformat into daily files for a DA system.

BoyerWOD commented 6 years ago

Thoimas,

The Argo files I have worked with, from the GODAE server, are not compressed. I deal with the single cycle Argo files, which have ocean profile variable arrays with dimensions for N_PROF (number of profiles), N_PARAM (number of parameters), N_LEVELS (number of depth/pressure levels). This works fine for the single cycle Argo files. It also works ok in the single float (all cycles) files. It works for these files because the single cycle or single float usually has data for N_PARAM at each of N_LEVELS, Even when this is not the case, for instance on Argo floats which have a primary temperature/salinity sensor which measures from 2000 db to 5 db and a secondary sensor which only measures in the top 20 db all the way to the surface, the 'empty' space, that which is filled with missing values for one parameter or another is not large, at least for a single cycle. But when you factor in BioArgo files, the empty space in these files starts to become significant. When you start to aggregate files over many floats with different instrument sets, you start to really run into big empty space problems- by empty space problems I mean large file sizes with a predominance of missing values. In IQuOD, with different instruments measuring multiple variables together in a file, the N_LEVELS array dimension becomes more empty space than useful values, and compression becomes the only viable way to keep the files manageable. (Note that Marty's example in this thread is for XBT data which have only one measured variable, temperature. I will post the latest PFL files, which now contain BioArgo this week, hopefully. These are a better test regarding size since they have many profile variables.)

Compression comes with its own problems. I am not sure what happens if we put a compressed file on a THREDDS server to be accessed by multiple machine-to-machine users at the same time, each attempting to access a subset of multiple files. I am not sure this is a viable option. Also, I would prefer not to rely on a format which is only manageable with compression.

Bill Mills, on another github thread noted that support for IQuOD netCDF was in his mid range plans, after getting out the first AutoQC data set. I can ask him for more specifics. Bob Simons, creator of ERDDAP has written software for ERDDAP to read the IQuOD format which will be available in his next release. He did not give me a specific date on release. I myself do not currently program in python. I have FORTRAN routines for reading the IQuOD ragged array form if that will help. Another option is to use the wodpy module to read the IQuOD data in ASCII format. Of course, you would then have to download the IQuOD dataset in that form through WODselect. WODselect will take about one day to generate these files for you. Yet another option would be to use WODselect to download the single cast netCDF files, which have a similar multidimensional structure to the Argo single cycle files. (The reason we did not go with the single cast files as the official format is simply that our NCEI archive cannot handle the large number of files - > 15 million, and the THREDDS server also has trouble with large file sets.)

Regards, Tim

BoyerWOD commented 6 years ago

Thomas,

Sorry, I mistakenly noted that you could get IQuOD in single cast netCDF file form through WODselect. We have World Ocean Database (WOD) single cast netCDF files, but not IQuOD - there are no uncertainties or intelligent metadata in these WOD files. Thinking about it, I could make these IQuOD files by adding the uncertainties and intelligent metadata, but that is not how these files are configured right now.

Tim

Thomas-Moore-Creative commented 6 years ago

Tim, first thanks for the detailed reply.

I get that this is not a trivial matter with many considerations. I'm very motivated to work with multicast netcdf datasets - less so with the many millions of files required to grab all ocean in-situ observations.

Working with ragged array is new to me and this is likely why I have yet to work out a confident and computationally efficient method to unpack these multicast netcdf files. Obviously future codes from @BillMills will be welcome.

For now following your Fortran codes might be helpful? Are they the ones available here? > https://www.nodc.noaa.gov/OC5/WOD13/wod_programs.html

BoyerWOD commented 6 years ago

FORTRAN code should be on the programs page, but it is not, I just have it internally for checks. It will take me about one week to get the code usable publicly.

Tim

Thomas-Moore-Creative commented 6 years ago

That's great Tim. While we wait for possibly python codes to deal with the aggregated NC files from @BillMills looking at these Fortran codes might help us work on a python tool in the shorter term? All your assistance is appreciated.

BoyerWOD commented 5 years ago

Thomas,

Just to make sure this is posted in all our communication systems: Glad you can read the netcdf ragged array files. It may not be useful anymore, but please find a FORTRAN program for reading (and displayiing) WOD and IQuOD netCDF ragged array files:

ftp://ftp.nodc.noaa.gov/pub/WOD/SELECT/wod_ncragged.tar.gz

Inside you will find the following:

gfortran_nc: run this to compile the program with GNU gfortran compiler You may need to change the netCDF include pathname (NCINC).

run as ."/gfortran_nc wod_nc" in the same directory as the C subroutines and the .x subdirectory.

isitreturn.c, isitnan.c - two C functions for character handling within the netcdf file - I simply dont know how to do it in FORTRAN.

wod_nc.x directory containing all the FORTRAN routines.

gfortran_nc catenates these files together into file wod_nc.f, then compiles with the C subroutines, leaving wod_nc.exe.

You can then run wod_nc.exe. This is all for linux- it compiles on both Red Hat and Centos. Thats all I can try here.

But the main purpose of the FORTRAN program is not to run, but to understand how the netCDF files are structured and how they can be read. The main program is wod_nc.x/wod_nc.f. I am not a programmer by training. I tried to sufficiently comment the program and subroutines. Also note that I have not yet added the subroutine for reading plankton data If you have any questions, please let me know