NCAS-CMS / cf-python

A CF-compliant Earth Science data analysis library
http://ncas-cms.github.io/cf-python
MIT License
125 stars 19 forks source link

Means to determine a read-in field's source file format #109

Open sadielbartholomew opened 4 years ago

sadielbartholomew commented 4 years ago

We support specification of a netCDF file format to write out to, but as far as I can see (I may be missing something obvious) there is no way in cf-python to determine, for read-in fields, the data model underlying the source file, e.g. the type of netCDF (classic, 64-bit offset, netCDF-4, CFA varaiants, etc.) else the .pp & .ff proprietary formats with any variants, as that does not appear to be encoded in the metadata.

I think users may be interested in this information, for example to know immediately based on the format whether there will be groups, etc. without having to inspect the group structure. So, similar to the source method on a field providing detail on the method of production of the original data, I propose a source_fmt or source_storage (or similar) method to return that information, assuming it is not overly difficult to determine that information when the file is read-in.

Some utility similar to that provided by shell-command inspection of the first four bytes of the file, and/or the ncdump -k option (based on the netCDF docs FAQ section 'How can I tell which format a netCDF file uses?'):

$ ncdump -k classic_file.nc
classic
$ ncdump -k netCDF4_file.nc
netCDF-4

to report the file format corresponding to fields read-in from a given file could be useful. For example, f.source_storage providing the format as a named string such as those listed as fmt for cf.write in the case of netCDF.

davidhassell commented 4 years ago

Good idea. Getting the netCDF information is straight forward, as it's reported by the netCDF4 library. I'm not sure if this information is so readily available for PP/UM files, but a code tweak would make it so if not.

However, we have to be careful, as aggregation can combine fields from files with different data models. This is why the get_filenames methods return a set of file names rather than a single string. Perhaps this method could be modified to return a dictionary whose keys are the file names with corresponding values of the file data model?

sadielbartholomew commented 4 years ago

Thanks for the insight. It sounds fairly straightforward, in that case.

However, we have to be careful, as aggregation can combine fields from files with different data models. This is why the get_filenames methods return a set of file names rather than a single string. Perhaps this method could be modified to return a dictionary whose keys are the file names with corresponding values of the file data model?

Yes, good thinking, that sounds like the most Pythonic way to manage the aggregation context.