IAMconsortium / concordia

Apache License 2.0
0 stars 3 forks source link

Issues with emissions files - round 2 #32

Closed dariak-bsc closed 5 months ago

dariak-bsc commented 7 months ago

Hi all,
Errors & warnings found with the file checker - preliminary results:

  1. (Warning) The attribute missing_value is missing, in the reference files we have both _FillValue and missing_value e.g.
    float BC_em_AIR_anthro(time, level, lat, lon) ;
        BC_em_AIR_anthro:_FillValue = 1.e+20f ;
        BC_em_AIR_anthro:missing_value = 1.e+20f ;
                ...

    but only _FillValue is present in the checked files.

  2. (Error) There are inconsistencies with the dimension sector: in the reference files the length of this dimension is either 9 (for CO2) or 8 (for other gas species)
    BC-em-anthro_input4MIPs_emissions_ScenarioMIP_IAMC-AIM-ssp370-1-1_gn_201501-210012.nc:
    dimensions: sector = 8 ;
    sector:ids = "0: Agriculture; 1: Energy; 2: Industrial; 3: Transportation; 4: Residential, Commercial, Other; 5: Solvents production and application; 6: Waste; 7: International Shipping" ;

    and

    CO2-em-anthro_input4MIPs_emissions_ScenarioMIP_IAMC-IMAGE-ssp126-1-1_gn_201501-210012.nc
    dimensions: sector = 9 ; 
    sector:ids = "0: Agriculture; 1: Energy; 2: Industrial; 3: Transportation; 4: Residential, Commercial, Other; 5: Solvents production and application; 6: Waste; 7: International Shipping; 8: Negative CO2 Emissions" ;

    but in the checked files we have a different length of the dimension sector: 6 for CO2 and 5 for other species, and the order of ids is not consistent:

    
    CO2-em-anthro_input4MIPs_emissions_RESCUE_IIASA-PIK-REMIND-MAgPIE-3.2.0-4.7.0-RESCUE-Tier1-Direct-2023-12-13-EocBudg1150-OAE-off-2023-12-08_gn_202001-210012.nc:
    dimensions: sector = 6 ;
    sector:id = "2: Industrial Sector; 4: Residential Commercial Other; 5: Solvents Production and Application; 3: Transportation Sector; 6: Waste; 7: International Shipping" ;

BC-em-anthro_input4MIPs_emissions_RESCUE_IIASA-PIK-REMIND-MAgPIE-3.2.0-4.7.0-RESCUE-Tier1-Direct-2023-12-13-EocBudg1150-OAE-off-2023-12-08_gn_202001-210012.nc: dimensions: sector = 5 ; sector:id = "2: Industrial Sector; 4: Residential Commercial Other; 3: Transportation Sector; 6: Waste; 7: International Shipping" ;


Additionally, in the reference files it is `sector:ids` (in plural) and in the checked files it is `sector:id`.

As discussed with @etiennesky, it would be probably better to keep the same length of the `sector` dimension as in the reference files, and keep the `sector:id` (or `sector:ids`?) in the ascending order from 1 to 8 (or 9).
coroa commented 7 months ago

Hi @dariak-bsc , hi @etiennesky,

thanks for submitting these results of your checker. Looks like we fixed most of the problems of round 1 then 🥳 🥳 (that is good news).

The remaining ones should be straightforward to address as well. There is a bit that needs discussion.

Re 1.

According to the CF Standard missing_value has been deprecated already before the first version of the CF Conventions 1.0, ie they were deprecated before 2003. I think more than 20 years after its deprecation, it is fine not to have that attribute.

Re 2.:

Oh, sure,

But thanks again for these submissions and the wonderful news!

coroa commented 7 months ago

I am also now watching the full repository, so you can expect that i'll respond quicker to your follow-up comments! (except that i am on vacation the coming two weeks 🤦 )

coroa commented 7 months ago

Some more feedback for the sector coordinate of the CO2_em_anthro variable:

This means that all *_em_anthro files will have consistently the length 7 and the sectors: Agriculture, Energy, Industrial, Transportation, Residential, Commercial, Other, Solvents production and application, Waste, International Shipping.

coroa commented 7 months ago

sector:ids turns out to be consistently named in the input4mips files, but the CF conventions standard does not contain any such coordinates, instead in section 6.1 Labels it proposes the use of string-valued coordinates (which Matt and I would also prefer strongly):

This would mean that the sector coordinate which has now the integer values 0 to 7 would be replaced everywhere by a coordinate with the labels sector = "Agriculture", "Energy", "Industrial", "Transportation", "Residential, Commercial, Other", "Solvents production and application", "Waste", "International Shipping".

The effect from within xarray for example would be that you can plot transportation emissions in "July 2050" by: ds.sel(sector="Transportation", time="2050-07").plot()

@etiennesky @dariak-bsc What do you think?

etiennesky commented 7 months ago

Re 1.

According to the CF Standard missing_value has been deprecated already before the first version of the CF Conventions 1.0, ie they were deprecated before 2003. I think more than 20 years after its deprecation, it is fine not to have that attribute.

yes, you are right _FillValue is the only required metadata and missing_value is deprecated. The CEDS data contain both, but it's fine to only have _FillValue

etiennesky commented 7 months ago

sector:ids turns out to be consistently named in the input4mips files, but the CF conventions standard does not contain any such coordinates, instead in section 6.1 Labels it proposes the use of string-valued coordinates (which Matt and I would also prefer strongly):

This would mean that the sector coordinate which has now the integer values 0 to 7 would be replaced everywhere by a coordinate with the labels sector = "Agriculture", "Energy", "Industrial", "Transportation", "Residential, Commercial, Other", "Solvents production and application", "Waste", "International Shipping".

The effect from within xarray for example would be that you can plot transportation emissions in "July 2050" by: ds.sel(sector="Transportation", time="2050-07").plot()

@etiennesky @dariak-bsc What do you think?

IMHO the ESMs are written in Fortran and have been programmed to use the hard-coded ids as integers, it would be best to keep things compatible.

etiennesky commented 7 months ago

Some more feedback for the sector coordinate of the CO2_em_anthro variable:

  • The Negative emissions sector for CO2 is superseeded by the new CO2_em_removal variable where those are split into sectors CDR DACCS, CDR OAE and CDR Industry

I think it was our intention since the beginning to provide both the total negative emissions as a single sector in the main CO2-em-anthro files. And then this total value would be split up among several new sectors in the new CO2_em_removal.

In summary, I would like to have bothso users can choose.

etiennesky commented 7 months ago

IMHO the ESMs are written in Fortran and have been programmed to use the hard-coded ids as integers, it would be best to keep things compatible.

For some context, while it is easier in python to access a string-indexed dictionnary, it is harder to do in Fortran.

coroa commented 6 months ago

IMHO the ESMs are written in Fortran and have been programmed to use the hard-coded ids as integers, it would be best to keep things compatible

Ok, i reviewed the netcdf C library, which i hope is very close in use to the fortran libs. Basically, coordinates and variables have an integer number that one inquires with nc_inq_varid and then you read into them with nc_get_vara_double/float by specifying a start and count integer array where you want to start and how much you want to read along any dimension.

If that is all they use, this would mean:

  1. We need to make sure the dimension ordering is the same: Input4MIPS had: sector, time, lat, lon for CO2 and time, sector, lat, lon everywhere else. I did go for the latter everywhere (consistency).
  2. We need to make sure to use the same datatype. Input4MIPS had floats for the main variable (we currently have double)
  3. The coordinates we specify might not matter at all (ie. if they never read the sector variable associated with the sector dimension and assume a hard-coded order instead, we can actually use string coordinates but have to make sure that they match the old order exactly). This scenario is quite likely since the datatype of the sector variable for CO2 was double and for the other gases int. BUT, note that i am also fine with sticking to the old solution.

Sure, i understand that. So as a summary, we need the anthro files to have always the full "sector" dimension with:

0: "Agriculture", 1: "Energy", 2: "Industrial", 3: "Transportation", 4: "Residential, Commercial, Other", 5: "Solvents production and application", 6: "Waste", 7: "International Shipping" 8: "Negative emissions (for CO2 only)"

We need to change the datatype to float.

coroa commented 6 months ago

Some more feedback for the sector coordinate of the CO2_em_anthro variable:

  • The Negative emissions sector for CO2 is superseeded by the new CO2_em_removal variable where those are split into sectors CDR DACCS, CDR OAE and CDR Industry

I think it was our intention since the beginning to provide both the total negative emissions as a single sector in the main CO2-em-anthro files. And then this total value would be split up among several new sectors in the new CO2_em_removal.

In summary, I would like to have bothso users can choose.

Ok, i was not aware that we wanted to have the negative emissions two times, but i'll make sure that this works.

etiennesky commented 6 months ago

Ok, i was not aware that we wanted to have the negative emissions two times, but i'll make sure that this works.

Thanks, this way our scenarios are "compatible" with the CMIP6 ones.

gidden commented 6 months ago

Some more feedback for the sector coordinate of the CO2_em_anthro variable:

  • The Negative emissions sector for CO2 is superseeded by the new CO2_em_removal variable where those are split into sectors CDR DACCS, CDR OAE and CDR Industry

I think it was our intention since the beginning to provide both the total negative emissions as a single sector in the main CO2-em-anthro files. And then this total value would be split up among several new sectors in the new CO2_em_removal.

In summary, I would like to have bothso users can choose.

Hi @etiennesky here I think I disagree and it would be useful to discuss more.

I had understood that we wanted to treat negative emissions this time explicitly different. I would rather prefer to provide a single file (anthro incl CDR) than include two files, both including negative emissions.

My primary concern is the risk that data users could accidently double count the negative emissions. One of the nice safeguards of the CMIP6 data is that any given emission flux was only provided once. You can stack them together, sum over different dimensions, and all data is consistent. I would strongly advise against breaking that pattern.

etiennesky commented 6 months ago

My primary concern is the risk that data users could accidently double count the negative emissions. One of the nice safeguards of the CMIP6 data is that any given emission flux was only provided once. You can stack them together, sum over different dimensions, and all data is consistent. I would strongly advise against breaking that pattern.

Hi @gidden I think your reasoning is fine, it will be a slight burden for any modelling group to implement this, but it should be ok as long as it adheres to the same structure as the existing CO2-em-anthro files (with a sector dimension properly documented).

ShraddhaGupta28 commented 6 months ago

Hi, Some feedback on the CO2-em-removal files: In the CO2-em-removal files, the sector ids are NaN for the OAE sector:
double sector(sector) ; sector:_FillValue = NaN ; sector:long_name = "sector" ; sector:id = "1.0: CDR Industry; nan: CDR OAE; 0.0: CDR DACCS" ; Also sector data type is double in these files compared to int64 in the corresponding CO2-em-anthro files.

EDIT: Moved into new issue #34 by @coroa .