ERDDAP / erddap

ERDDAP is a scientific data server that gives users a simple, consistent way to download subsets of gridded and tabular scientific datasets in common file formats and make graphs and maps. ERDDAP is a Free and Open Source (Apache and Apache-like) Java Servlet from NOAA NMFS SWFSC Environmental Research Division (ERD).
Creative Commons Zero v1.0 Universal
84 stars 58 forks source link

Datatypes changed when using .nc download #89

Closed callumrollo closed 3 months ago

callumrollo commented 1 year ago

It appears that variables stored in ERDDAP as integers of various sizes are converted to float32 when exported as netCDF.

Take this example dataset

https://erddap.observations.voiceoftheocean.org/erddap/tabledap/nrt_SEA068_M27.html

The variable conductivity_qc is a qc variable that can only have the value of an integer between 1 and 9. As such, we have specified it within ERDDAP as a byte, an 8 bit integer

  conductivity_qc {
    Byte _FillValue 127;
    String _Unsigned "false";
    Byte actual_range 1, 4;
    Float64 colorBarMaximum 10.0;
    Float64 colorBarMinimum 0.0;
    String comment "Quality control flags from IOOS QC QARTOD https://github.com/ioos/ioos_qc Version: 2.1.0. Using config: [<Call stream_id=conductivity function=qartod.gross_range_test(suspect_span=[6, 42], fail_span=[3, 45])>, <Call stream_id=conductivity function=qartod.location_test(bbox=[10, 50, 25, 60])>].  Threshold values from EuroGOOS DATA-MEQ Working Group (2010) Recommendations for in-situ data Near Real Time Quality Control [Version 1.2]. EuroGOOS, 23pp. DOI https://dx.doi.org/10.25607/OBP-214.";
    String flag_meanings "GOOD, UNKNOWN, SUSPECT, FAIL, MISSING";
    Float64 flag_values 1, 2, 3, 4, 9;
    String ioos_category "Quality";
    String ioos_qc_module "qartod";
    String long_name "quality control flags for water conductivity";
    String quality_control_conventions "IOOS QARTOD standard flags";
    Float64 quality_control_set 1;
    String standard_name "sea_water_electrical_conductivity_flag";
    Byte valid_max 9;
    Byte valid_min 1;
  }

However, when this datasets is downloaded as a netCDF this variable, and all others, have been converted to float32. I believe this produces a substantial and avoidable increase in download size.

This issue does not occur appear to occur with export as .csv, as the integers are exported as such, not as floating points.

BobSimons commented 1 year ago

That shouldn't happen. But as you'll see below, I can't reproduce the problem.

I think the first step to solve this problem is to see what data type is used for that variable in that dataset. If I go to https://erddap.observations.voiceoftheocean.org/erddap/tabledap/nrt_SEA068_M27.html and hover over the (?) icon by conductivity_qc, I see that the data type is indeed "Byte". Good.

So then I made a request for a .nc file with this URL https://erddap.observations.voiceoftheocean.org/erddap/tabledap/nrt_SEA068_M27.nc?latitude%2Clongitude%2Ctime%2Cconductivity%2Cconductivity_qc&time%3E=2022-07-31T00%3A00%3A00Z&time%3C=2022-07-31T03%3A51%3A42Z I downloaded that file and renamed it voto.nc.

I then used ncdump -h to see what is in the file. It showed (just the part for conductivity_qc): byte conductivity_qc(row=243); :_FillValue = 127B; // byte :actual_range = 1B, 1B; // byte :colorBarMaximum = 10.0; // double :colorBarMinimum = 0.0; // double :comment = "Quality control flags from IOOS QC QARTOD https://github.com/ioos/ioos_qc Version: 2.1.0. Using config: [<Call stream_id=conductivity function=qartod.gross_range_test(suspect_span=[6, 42], fail_span=[3, 45])>, <Call stream_id=conductivity function=qartod.location_test(bbox=[10, 50, 25, 60])>]. Threshold values from EuroGOOS DATA-MEQ Working Group (2010) Recommendations for in-situ data Near Real Time Quality Control [Version 1.2]. EuroGOOS, 23pp. DOI https://dx.doi.org/10.25607/OBP-214."; :flag_meanings = "GOOD, UNKNOWN, SUSPECT, FAIL, MISSING"; :flag_values = 1.0, 2.0, 3.0, 4.0, 9.0; // double :ioos_category = "Quality"; :ioos_qc_module = "qartod"; :long_name = "quality control flags for water conductivity"; :quality_control_conventions = "IOOS QARTOD standard flags"; :quality_control_set = 1.0; // double :standard_name = "sea_water_electrical_conductivity_flag"; :valid_max = 9B; // byte :valid_min = 1B; // byte So the variable is still stored as a byte in the .nc file that I downloaded.

So I have no explanation for why you see float32. Are you perhaps looking at the conductivity (not qc) variable? Did you use ncdump to determine the data type for the variable in the .nc file? (If not, why do you think it is a float32?) Could you please tell me the exact URL you used to download the data (so I can reproduce the problem)?

Best wishes.

callumrollo commented 1 year ago

Hi Bob,

Thanks for checking this. You're right, the issue lies in Python xarray's treatment of integers and fill values, not in ERDDAP. I'll use ncdump to check downloaded files in future. I will migrate this issue to erddapy, as the default behavior of xarray is converting integer arrays to float32 at read.

callumrollo commented 1 year ago

@BobSimons I'm getting a security policy rejection with your NOAA email 550 5.7.1 unrecognized address. Looks like I may need to be allowlisted . You can contact me at c.rollo@outlook.com

ocefpaf commented 1 year ago

I will migrate this issue to erddapy, as the default behavior of xarray is converting integer arrays to float32 at read.

I saw that problem a few years back but it was gone when the data provider updated their ERDDAP server. I guess that this is a new problem. However, the issue is with xarray and/or maybe the libnetcdf version. So there isn't much we can do in erddapy b/c we just read the downloaded file directly with NetCDF4DataStore.

BobSimons commented 1 year ago

@ocefpaf, can you report the bug to the maintainers of NetCDF4DataStore?

ChrisJohnNOAA commented 3 months ago

Since this is not an ERDDAP issue, I'm going to close the issue here.

I think the bug has been tracked elsewhere, but if it's still an issue please raise with either erddapy or NetCDF4DataStore.