IAMconsortium / concordia

Apache License 2.0
0 stars 3 forks source link

Wrong variable type for sector? #57

Closed TimotheeBrgs closed 2 months ago

TimotheeBrgs commented 3 months ago

In the previous version (2023-12-08), the variable sector had the type int64. and it was fine. Now, in version 2024-04-25, sector has the type string while there is only integer values from 0 to 13 in this variable, if I'm not mistaken.

Below is the CDO error message I get. I guess that sector has the variable ID = 3. cdf_get_var_double: ncid=65536 varid=3 val[0]=0.000000 Error (cdf_get_var_double): NetCDF: Not a valid data type or _FillValue type mismatch

CDO is fine if I pre-process files with sector dimensions using the command below: ncap2 -O -s 'sector=int64(sector)' your_file.nc output.nc

coroa commented 3 months ago

Hmm, hi @TimotheeBrgs ,

After a discussion with @etiennesky in #32 (refer especially to my summary in https://github.com/IAMconsortium/concordia/issues/32#issuecomment-1952600975), I decided to ensure the consistent sector order specified there, but to switch the sector variable instead to contain the sector labels instead of the integers.

If i had not done so, the sector variable would by definition always just contain the integer values [0, 1, 2, 3, ..., 13] (for CO2) in this exact order, which is non-sensical (ie then we don't need the variable).

Can you explain why you read in the sector variable at all?

Or give me a cdo invocation that you use, which shows this problem?

etiennesky commented 3 months ago

Hi, for compatability with cmip6 (and maybe cmip7) scenarios, I think we should keep an index-based approach for the sectors. Most ESMs are written in Fortran and it's easier for them to use the int index approach. For some emissions the sectors ARE important as they indicate the injection height of the species.

coroa commented 3 months ago

@etiennesky How is it not index based? That is what I don't understand.

coroa commented 3 months ago

Ok, let me explain in more words, if you want to access the emissions in the industry sector (then you know that is index 2), so you access the equivalent of:

int startp[] = { 0, 2, 0, 0 };
int countp[] = { 10, 1, 180, 180 };
int varid = nc_inq_varid(ncid, "CO2_em_anthro",...);
float data[10][1][180][180];
nc_get_vara_float(ncid, varid, startp, countp, &data); // read in all the float32 values of the industry sector into data

ie. you refer to the data anyway by index, the sector variable does not matter for this, just ignore it.

TimotheeBrgs commented 3 months ago

Hi @coroa

Can you explain why you read in the sector variable at all?

It seems that CDO checks the overall consistency of the file before processing it.

Or give me a cdo invocation that you use, which shows this problem?

cdo -O -remapcon,$target_grid a_surface_emission_file.nc output.nc

Returns: cdf_get_var_double: ncid=65536 varid=3 val[0]=0.000000 Error (cdf_get_var_double): NetCDF: Not a valid data type or _FillValue type mismatch

My interpretation is that the valid issue here is "Not a valid data type" and not "_FillValue type mismatch".

coroa commented 3 months ago

Thanks, @TimotheeBrgs . Note that string valued coordinates are valid from a NetCDF conventions standpoint (https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#_labels_and_alternative_coordinates). But I'll investigate what cdo expects and why it fails.

Can you give me an allowed value (your value ideally) for the target grid variable? Thanks

coroa commented 3 months ago

Ok, @TimotheeBrgs , @etiennesky , it turns out that CDO is not able to understand variable-length strings yet (which also only were added to the CF conventions in CF 1.9 (September 2021)), but it operates fine for me with char encoded strings, which i can also generate quite easily.

I made a new dummy test file to illustrate and uploaded it to the FTP server: CO2-em-anthro_input4MIPs_emissions_RESCUE_IIASA-PIK-REMIND-MAgPIE-3.2.0-4.7.1-RESCUE-Tier1-Direct-2024-04-25-PkBudg500-OAE-on-2024-06-04_gn_201501-210012.nc (note today's date) in /forcings/emissions/2024-04-25 (the previous directory).

Can you test whether this progresses fine for you, too, @TimotheeBrgs ?

Thanks

TimotheeBrgs commented 3 months ago

Hi @coroa,

Can you give me an allowed value (your value ideally) for the target grid variable?

Not sure if you need it anymore, now that you have investigated further this case but here is the target grid: Download link (60 MB, link expires on Friday, netCDF input file of emissions from another SSP).

Regarding the dummy test file:

Can you test whether this progresses fine for you, too, @TimotheeBrgs ?

The situation improves dramatically, thank you very much for this. CDO agrees to remap the file but now it displays a warning message: Warning (cdf_set_dimtype): Could not assign all character coordinates to data variable

However, I am still a bit puzzled by the workaround that you suggest.

Emissions sectors from CMIP6 input4MIPs are defined as follow:

dimension:
    sector = 9 ;
variables:
    double sector(sector) ;
        sector:long_name = "sector" ;
        sector:bounds = "sector_bnds" ;
        sector:ids = "0: Agriculture; 1: Energy; 2: Industrial; 3: Transportation; 4: Residential, Commercial, Other; 5: Solvents production and application; 6: Waste; 7: International Shipping; 8: Negative CO2 Emissions" ;

Whereas the dummy test file reads:

dimensions:
    sector = 14 ;
    string35 = 35 ;
variables:
    char sector(sector, string35) ;
        sector:long_name = "sector" ;
        sector:ids = "0: Agriculture; 1: Energy; 2: Industrial; 3: Transportation; 4: Residential, Commercial, Other; 5: Solvents Production and Application; 6: Waste; 7: International Shipping; 8: CDR Afforestation; 9: CDR BECCS; 10: CDR DACCS; 11: CDR EW; 12: CDR Industry; 13: CDR OAE" ;
        sector:_Encoding = "utf-8" ;

The CMIP6 input4MIPs structure seems cleaner with a single dimension and double for sector type, don't you think?

coroa commented 3 months ago

Thanks for testing. Yes, I am also seeing the same warning:

Warning (cdf_set_dimtype): Could not assign all character coordinates to data variable

but so far i have not understood what it means. i confirmed, that the transformed file contains all dimensions, coordinates and variables including the character coordinate sector.

The second dimension string35 is a by-product of how fixed-length string variables used to be stored before netcdf4 came along and allowed also variable-length strings (which then were also introduced into the CF 1.9 standard).

For libraries, which understand character-based coordinates, the new structure allows to just use the name instead, ie.:

ds = xr.open_dataset("....nc")
da = ds["CO2_em_anthro"].sel(sector="Agriculture")

for the others, you can still use the index-based selection:

cdo sellevidx,1 ....nc output.nc

in principle cdo also recognizes the string-based levels:

❯ cdo zaxisdes CO2-em-anthro_input4MIPs_emissions_RESCUE_IIASA-PIK-REMIND-MAgPIE-3.2.0-4.7.1-RESCUE-Tier1-Direct-2024-04-25-PkBudg500-OAE-on-2024-04-25_gn_201501-210012.nc
cdi  warning (cdf_set_dimtype): Could not assign all character coordinates to data variable!
#
# zaxisID 1
#
zaxistype = area_type
size      = 14
name      = sector
longname  = "sector"
levels    =
     [ 0] = Agriculture
     [ 1] = Energy
     [ 2] = Industrial
     [ 3] = Transportation
     [ 4] = Residential, Commercial, Other
     [ 5] = Solvents Production and Application
     [ 6] = Waste
     [ 7] = International Shipping
     [ 8] = CDR Afforestation
     [ 9] = CDR BECCS
     [10] = CDR DACCS
     [11] = CDR EW
     [12] = CDR Industry
     [13] = CDR OAE
ids = "0: Agriculture; 1: Energy; 2: Industrial; 3: Transportation; 4: Residential, Commercial, Other; 5: Solvents Production and Application; 6: Waste; 7: International Shipping; 8: CDR Afforestation; 9: CDR BECCS; 10: CDR DACCS; 11: CDR EW; 12: CDR Industry; 13: CDR OAE"
_Encoding = "utf-8"
cdo    zaxisdes: Processed 1 variable [0.02s 28MB]

but i don't see an operator for selecting based on a string.

Finally, regarding the old CMIP6 structure, note that it was not even consistent. CO2_em_anthro had:

dimension:
    sector = 9 ;

variables:
    double sector(sector) ;
        sector:long_name = "sector" ;
        sector:bounds = "sector_bnds" ;
        sector:ids = "0: Agriculture; 1: Energy; 2: Industrial; 3: Transportation; 4: Residential, Commercial, Other; 5: Solvents production and application; 6: Waste; 7: International Shipping; 8: Negative CO2 Emissions" ;

data:
 sector = 0, 1, 2, 3, 4, 5, 6, 7, 8 ;

while all the others were:

dimensions:
    sector = 8 ;

variables:
    int sector(sector) ;
        sector:long_name = "sector" ;
        sector:bounds = "sector_bnds" ;
        sector:ids = "0: Agriculture; 1: Energy; 2: Industrial; 3: Transportation; 4: Residential, Commercial, Other; 5: Solvents production and application; 6: Waste; 7: International Shipping" ;

data:
 sector = 0, 1, 2, 3, 4, 5, 6, 7 ;