alan-turing-institute / clim-recal

Open repository of methods for recalibrating & bias correcting UKCP18 climate projections data
https://alan-turing-institute.github.io/clim-recal/
MIT License
10 stars 1 forks source link

HADs 360 day calendar conversion #32

Closed RuthBowyer closed 1 year ago

RuthBowyer commented 1 year ago

Moving to new issue as may have noticed an additional problem - this is the original thread and relevant comment:

I wasn't aware of the 356 days on the leap years, this is the line used to create the calendar conversion, and looking at the documentation I'm not sure why the 365 days are still occurring... Documentation. https://docs.xarray.dev/en/stable/generated/xarray.Dataset.convert_calendar.html

I need to look up how 360 day calendars are sampled, but seems in the 360 day HADs calendars, Feb is only sampled up the 27th - (ie why not use all 28 days in Feb? Presumably there's a reason but wanted to note - it could be something to do with how I've pulled through the date names to the format Im working with the data in but can't work it out!

aranas commented 1 year ago

If we change the align_on option to 'date' it should fix it for this dataset, but we might want to

dingaaling commented 1 year ago

Current @RuthBowyer workaround to remove 5 days (e.g. 31 of month); didn't notice duplicate days would be good to document. @aranas discussed producing script to reproduce error to double check

aranas commented 1 year ago

@RuthBowyer Could you please specify which file specifically only runs up to february 27th? I am not sure I can reproduce your error. Here is what I see instead: I have added a script (check_calendar.py) that loads both the raw hads data and the resampled one and compares the dates between the two. I pushed both the script as well as the log file it produces to a branch bug_calendar. In the log file you can see that for some months a day from a previous month is added and/or the final date 31st is dropped. This is because of the flag "year"

- "year"
            The dates are translated according to their relative position in the year,
            ignoring their original month and day information, meaning that the
            missing/surplus days are added/removed at regular intervals.

so essentially the dates have been realigned in terms of absolute position, which means that a march file might only run til 29th and march 30th would appear in the april file. To me it seems this is not what we want because it leads to duplicate dates, eg 1980-03-30 appears both in tasmax_hadukgrid_uk_1km_day_19800301-19800330.nc but also in tasmax_hadukgrid_uk_1km_day_19800401-19800430.nc

So my questions are:

aranas commented 1 year ago

commit (sry, I forgot to link issue) 62b86f564adcc02d32d53a2e2cad4bfafd96b799

RuthBowyer commented 1 year ago

Thanks so much Sophie!

Just for my own clarity, where you say:

tasmax_hadukgrid_uk_1km_day_19800301-19800330.nc but also in tasmax_hadukgrid_uk_1km_day_19800401-19800430.nc

Does this mean the 1km hadsukgrid is being resampled to 30 days before it's being resampled to the 2.2km grid? My understanding was the other way round but not a problem either way!

I was using a file a few steps down the pipeline but because I added a little bit of code to rename the layers (the input files I was using are in this series tasmax_hadukgrid_uk_1km_day_2.2km_resampled_19800201-19800229.nc in /mnt/vmfileshare/ClimateData/Processed/HadsUKgrid/resampled_2.2km/tasmax/ , with output files the cropped dfs - however because this pulls the raster layers through as 'tasmax_1' etc rather than a date I renamed them for sanity checking but I must have introduced this:

Could you please specify which file specifically only runs up to february 27th? I am not sure I can reproduce your error.

~there (although can't immediately spot how, however, but have checked the input file and indeed it has 29 days in that resamples month - Feb 1980) - apologies for adding further confusion! I'll go back and alter this when we've sorted the next bit.~ -- see comment below

In answer therefore to your Qs:

so essentially the dates have been realigned in terms of absolute position, which means that a march file might only run til 29th and march 30th would appear in the april file. To me it seems this is not what we want because it leads to duplicate dates, eg 1980-03-30 appears both in tasmax_hadukgrid_uk_1km_day_19800301-19800330.nc but also in tasmax_hadukgrid_uk_1km_day_19800401-19800430.nc

Yes this is confusing, and what I think confused me initially! But we could work with it if we just have 360 days in every year?

So my questions are:

am I working with the correct files? (since I don't reproduce your observation)

See answer above

is what I am observing an issue for the data you have already bias corrected?

Yep, in that I am using this data as the observation data for the bias correction, and have already done so. However, I don't think we need to worry very much about this - the dates will align enough for it to reflect accurately the seasonality of observations (ie the weather on 30th march is quite similar to 1st April) - but I think aligning as well as possible is a good aim going forwards. I'll proceed with BC three cities and UK wide data for now, but for comparing the methods across the 3 cities, it would be fairly straightforward to just rerun the assessments etc when we've decided on a final dataset for this issue :-)

let's double check our desired outcome: by resampling to a 360 calendar, we expect that every month has 30 days (including February), how do you expect should the data for the added days be created (nan, interpolation, duplicate?)

It's a great Q - to me, whatever the default in the resampler would be fine, as whoever made it probably knows more about it than me (!). I guess because a year has >360 days my assumption would be days would just be shifted to accommodate this and then otherwise dropped (eg 1-2nd March becomes 29th-30th Feb) and then surplus '31st's just get dropped? We could ask the MO folks if they have a preference?

RuthBowyer commented 1 year ago

Sorry, I just realised from talking to @gmingas where the 27 day thing I had mentioned was coming from @aranas

/mnt/vmfileshare/ClimateData/Processed/HadsUKgrid/resampled_2.2km/tasmax/day/tasmax_hadukgrid_uk_1km_day_2.2km_resampled_19810201-19810228.nc

Has only 27 vals (ie only 27 days in Feb rather than 30). This doesn't effect my answers above however, and I guess as long as we're aligning across 360 days its ok?

RuthBowyer commented 1 year ago

Also, relatedly, I believe the underlying raw HADs data containing NA cells, and therefore are grid cells in the CPM crops which don't have a value for the corresponding HADs crops. Relevant also for #34 and applying subsequent bias correction @gmingas

I've just been subsetting them to match - although thinking about it this could actually be causing the issue observed in #33 , which would be great!! ~I will double check this - but just again to flag I believe the underlying raw HADs data contains NA cells where the CPM does not, probably mostly over water -- if someone had time to confirm this whilst I was away that would be great!~ I have double checked this and indeed the raw Hads contains NA cells where water bodies are

RuthBowyer commented 1 year ago

Also, if we're re-running this anyway, maybe we could rename the HADs output from 'rainfall' to 'pr' to match the naming convention of the CPM (it would make processing easier, for me at least!)

aranas commented 1 year ago

Hi @RuthBowyer ,

Does this mean the 1km hadsukgrid is being resampled to 30 days before it's being resampled to the 2.2km grid? My understanding was the other way round but not a problem either way!

temporal resampling happens indeed after spatial resampling, you're right! I meant the resampled versions of those files.

aranas commented 1 year ago

I think I finally understand why we get the duplicates and I think this might be a bit of a bug in the convert_calendar xarray function when applying it to partial files. The 5 surplus days we get after conversion are the following: date '2020-03-30' appears 2 times. date '2020-05-30' appears 2 times. date '2020-07-30' appears 2 times. date '2020-09-30' appears 2 times. date '2020-11-30' appears 2 times.

and this is coherent with the documentation because when converting

From a standard calendar to a “360_day”, the following dates in the source array will be dropped:

From a leap year:

    January 31st (31), April 1st (92), June 1st (153), August 1st (214), September 31st (275), December 1st (336)

So what happens is that april 1st instead of being "dropped" (like it correctly happens for non-leap years, the date is simply renamed to the previous day (eg march 30th), and all other dates are shifted (eg april 2nd becomes april 1st) However because we do the conversion one month at a time, the algorithm does not take into account that a previous month already contains march 30th, and we end up with the duplicate date.

So if we simply delete the second instance, of the duplicate dates, we achieve the output as specified by the documentation, plus the calendars should align nicely.

however, this can only be done once the files are recombined which happens at two distinct points due to the preprocessing happening in both R and python, so maybe @gmingas we can find the right spot for correcting this together during coworking on Monday.

In the meantime I have also filed a bug with a minimal complete verifiable example over on the xarray repo, to confirm that this is truly a bug: https://github.com/pydata/xarray/issues/8086

aranas commented 1 year ago

and just to confirm @RuthBowyer

Sorry, I just realised from talking to @gmingas where the 27 day thing I had mentioned was coming from @aranas

/mnt/vmfileshare/ClimateData/Processed/HadsUKgrid/resampled_2.2km/tasmax/day/tasmax_hadukgrid_uk_1km_day_2.2km_resampled_19810201-19810228.nc

Has only 27 vals (ie only 27 days in Feb rather than 30). This doesn't effect my answers above however, and I guess as long as we're aligning across 360 days its ok?

Yes this is actually correct behavior of the convert_calendar call, because of the realignment and dropping of february 6th in non-leap years, the file ends up running from february 2nd to feburary 28th and the next file (march) will contain dates february 29th and february 30th, so after recombining all dates, we indeed end up with 360.

The question is whether it is an issue that some dates are temporarily (before combining across months) associated with the "wrong" files (eg february dates occurring in march file).

Also, I understand now that we don't need to interpolate data, because indeed in this case, we only drop dates & then realign as you suggested. The only problem to be solved then is to delete the duplicate dates. Which will be very similar to what you have done in your hack of removing random dates, just that we end up with a clean calendar, rather than keeping the duplicates. (or they fix the behavior on the xarray side.. but could probably take a while)

RuthBowyer commented 1 year ago

Thank you for this clarification @aranas ! This all sounds great.

Just to clarify re interpolation, there is still a reason we might consider it:

Also, relatedly, I believe the underlying raw HADs data containing NA cells, and therefore are grid cells in the CPM crops which don't have a value for the corresponding HADs crops. Relevant also for https://github.com/alan-turing-institute/clim-recal/issues/34 and applying subsequent bias correction

(Ie the issue I thought was due to the shapefile was actual due to the obs data, as doc'd here: #33 ) - so we might want to interpolate the NAs - but probably i) this should be done after bias correction and ii) I should start a new issue for this to avoid confusing everyone!

aranas commented 1 year ago

ah yes. Indeed it seems like opening a new issue would be the way to go. Then we can plan it into a sprint.

gmingas commented 1 year ago

@gmingas to sense check the script in https://github.com/alan-turing-institute/clim-recal/tree/bug_calendar

aranas commented 1 year ago

Hi @gmingas @RuthBowyer I want to close this issue but I cannot seem to run the script all the way to the end 😭. When I ssh into VM there is no anaconda distribution and I cannot install cause no space left. I am sure there is a solution to this and I can keep on trying to run it from VM but I am thinking, since both of you have the set-up would it be easier if one of you runs the script and write the data to the folder? I think I keep producing corrupted files because my fileshare connection breaks down but the script keeps running.

I run it with the following specs: python resampling_hads.py --input '/Volumes/vmfileshare/ClimateData/Raw/HadsUKgrid/tasmax/day' --output '/Volumes/vmfileshare/ClimateData/Processed/HadsUKgrid/resampled_calendarfix/tasmax' --grid_data '../../data/rcp85_land-cpm_uk_2.2km_grid.nc'

and then I use check_calendar.py to ensure all dates are there as intended

RuthBowyer commented 1 year ago

I can give running it a go via the dyme vm! Will update with how I get on :-)

RuthBowyer commented 1 year ago

Hey @aranas and @gmingas - This seems to have worked and I have ran for tasmax, tasmin and rainfall.

To note:

To get the script to work, I had to install rioxarray. I don't understand why this worked as its not called from the script directly (perhaps you folks know?) - is it just updating xarray somehow?

If the data looks OK, I think we can close this issue :)

aranas commented 1 year ago

Thank you Ruth, should've handed this over earlier and happy that we can close it now :) I'm copying here from our conversation on slack:

The rioxarray is an extension for the xarray package that allows it to work with geospatial data. Seems that extensions don't require you to import explicitly, as long as it is installed xarray functions can make use of the added "rio" attribute.

Looking at the documentation of rioxarray we could import it directly. If we would do that then we would also run the initialization code for that module and it might be that this would set up some configs necessary to perform rioxarray specific functionality. So I think we would need to dive deeper into the package to decide if we'd rather call rioxarray directly. benefits might be:

I guess the only Con is that the syntax might change slightly, creating more work for us at this point. For now, I have noted this down to make sure and mention the package in the documentation, so that users are aware of it being used implicitly. Thank you for pointing this out!

aranas commented 1 year ago

End result: