VeruGHub / easyclimate

Easy access to high-resolution daily climate data for Europe
https://verughub.github.io/easyclimate/
GNU General Public License v3.0
45 stars 1 forks source link

montly data (ongoing) #48

Open VeruGHub opened 7 months ago

VeruGHub commented 7 months ago
Pakillo commented 7 months ago

Fine with me. Just one doubt: why are there NAs?

Pakillo commented 3 months ago

Hi, I'm reviewing the monthly data branch and I think it would be best to not provide monthly data when there are days with NA (ie set the value for the month = NA too). Otherwise I think there is a high risk of bias and people misunderstanding/misusing the monthly climate data.

Imagine there is a single day without data in a month, but that day there was quite a lot of rain (which we can't know). The monthly precipitation data will be very biased. People may think just one missing day in a month is fine, but I think there is a high risk of bias, and most importantly, we can't ascertain the bias just by looking at the number of missing days: just one missing day may be enough for inducing significant bias. For example, in places where it rains nearly every day, the monthly sum will be biased every single time (and IMO should not be used as such). Also with temperatures (e.g. missing heat or cold waves).

I strongly think monthly data should be NA whenever there are days with missing data (how frequent are these??)

Pakillo commented 3 months ago

Also, setting monthly data to NA when there are missing data will greatly simplify the code and server load (no need to download missingdata rasters) and the output (no need to create missing data rasters either). So better for users and developers IMO

Now the user can't just use plot(output) to get a map because the output is a list of rasters (values + NA layers)

Julenasti commented 3 months ago

I agree with Paco here. As a user, I would prefer more precise data even if it's less than more data and ignoring the uncertainty. The NA days per month can be a proxy for uncertainty but it would be a rather dubious proxy because we don't know the values we are missing

Pakillo commented 3 months ago

Thanks Julen. I apologise because that approach was proposed early and I didn't object. I think I didn't understand the implications then. But now I'm quite convinced we should step back (i.e. set monthly value = NA when missing daily data). Sorry. Of course, if we all agree on this, I can make the required changes if that is helpful so as not to give more work to you.

VeruGHub commented 3 months ago

I think that we need to discuss a bit more about this before taking a decission, so here some thoughts.

I think that giving the coverage of the data (which is not the uncertainity itself but is a component of uncertainity) gives the user the opportunity to decide by considering his/her specific case study or region. Every user could transform the monthly values into NAs where there is at least 1 missing daily data. Missing 1 daily data per month in precipitation can be dangerous and that depends a lot in the precipitation regime. I agree on that. Missing 1 daily data per month in temperature is going to make hardly any change in the monthly average values. Should we then set NAs to monthly data in all situations? We will be loosing relevant pieces of information by doing so. As an user, if I find NAs in monthly data with the explanation that any daily data is missing (from 1 to 30 days), but without knowing more information, I would probably go to the daily data to see what's going on and maybe calculate myself the average. Maybe we need to see how NAs are distributed in space and time (i.e. are them all aggregated into a year or specific months or days?) before deciding. Or maybe knowing the origin of the daily NAs would help.

Other things discussed by email: I don't see how giving the average values plus the coverage is against reproducibility if we are clear about the procedure. I don't see so clear the option of giving the users monthly NAs when in the original raster there is a value because we would give different information depending on the procedure used for downloading the data. So if we finally consider that any NA would do the average values to have a serious bias, then I would recalculate the rasters.

An intermediate option would be to keep the avegages+coverage and warn very explicitly when downloading.

I really want to do this update it in the best way, even when it implies undo most of the work I did in December that was related to adapt the scripts and tests to include the coverage. But for me, the option of giving monthly NAs in any situation when there are daily NAs does not seem so straightforward.

Pakillo commented 3 months ago

Thanks Vero

Yes, we would have to recalculate the rasters. What I meant is that we probably don't need to calculate monthly values from daily rasters again, but instead just overlay the monthly rasters with the missingdata rasters to set NA values when the number of missing days > 0. Then in the FTP server, there would be the monthly rasters with NA values when corresponding. And those would be the rasters we would query with easyclimate. No need to host the missingdata rasters IMO, to save FTP space (@cpucher). Also, the output would still be a SpatRaster, not a list, so can be plotted directly, etc.

I think there is a crucial distinction between a script written ad hoc for a specific analysis and a piece of software used by thousands of people. I'm fine with you deciding, based on your skills and knowledge, that one day of missing data is ok for your analysis in your specific area and time period, and go on. But easyclimate is already (and will be) used by thousands of people, and IMO we have some responsibility on the data and software we provide and the choices we make. Many people will not read the documentation, or any warning we might print in the console. Many people will not look into the NA columns in the data frame, and will use the values straight away. Even if they look into that column, they may think that 1, 2, 3 days missing in a month should be fine, and then use those biased data in their analyses, and publish them. Likewise, many people will use the rasters with the monthly values, without worrying about the accompanying missingdata rasters. Even if they might care, many people will not know how you can combine both lists of rasters to set NA values when the raster with missing days is > 1. And so on. So I'm sure there would appear too many papers using biased climate data. We should avoid that. We shouldn't open the path for that.

So I think we have some responsibility on providing data that are as correct as possible, and anticipate decisions by a wide array of users with diverse skills and knowledge. We know that providing monthly values based on incomplete data is going to be biased very often, too often. And the number of days missing is a poor proxy to evaluate the amount of bias. Even a single missing day may already mean quite a lot of bias in some areas and periods. Of course precipitation is going to be more affected, because it's a sum. But temperature (average) could be affected too. Particularly if data are not missing completely at random.

Think of what sum and mean functions do in R by default when there are missing data:

sum(c(1, 2, NA))
#> [1] NA

mean(c(1, 2, NA))
#> [1] NA

They do this for good reason. Of course you can still ask for the sum or mean ignoring the missing data, but that requires a conscious action by the user. I think we should act the same. IMO the most correct way is to set monthly values as NA when there are missing data, and then the user can go inspect the daily data for that month, as you say, and decide what to do: maybe ignore the missing days, maybe do some missing data imputation, etc. But they must take a conscious decision. That's what typical software does, and that's what we should also do IMO.

By lack of reproducibility I meant that people would take many different decisions about what to do with the info on the number of days missing, i.e. what threshold to use, when to discard the monthly data etc. Some people might only take monthly values if there are no missing days, some people might use a 1-day threshold, 2-day threshold, and so on. And nobody is going to know which decisions the authors took. Remember that only a small minority of papers share their code. And few people would describe in methods which thresholds they used. If we are lucky, they will cite easyclimate and the data source v. 4. So what I meant is that different authors would take different decisions with the data provided, and nobody would know which one, and why everybody gets different results with the same climate data to start with. So it would be more difficult to reproduce the results. This is a minor concern compared to people using and publishing papers based on biased data, but still...

I hope my points are more clear now. The 'good' news is that we are in time to revert this, and it shouldn't be too much work if we recalculate the rasters as above. Reverting the code is also not complicated (I can help with that of course)

Thanks

cpucher commented 2 months ago

Dear all,

I agree with Paco on this one. There is just too much room for "error" or "misshandling" if NA's are removed on default.

Regarding the re-calculation of the rasters, don't worry too much about this, as Paco pointed out, I don't have to go back to the daily data again, but can use the already calculated monthly + missingdata rasters to set the NA values.

I also understand that Vero put some work into the code to allow the retrieval of both the monthly and the missingdata rasters. Another option would be to add a na.rm Option, which the user consciously needs to set to TRUE and then he will get both the monthly and missingdata rasters. On default (na.rm=FALSE) he will only get the monthly rasters and if at least one day is missing the value for that month would be NA. However, I'm not sure if this is really needed, as the users themselves can use the daily data to inspect in case they get an NA returned. And then, as Paco pointed out, they have to take a conscious decision how to deal with it.

Regarding the coverage: I never looked into it in detail, but the only areas having serious coverage issues are Greece and the southernmost parts of Italy (Sicily). Missing data always comes from missing daily data in the E-Obs data. I can look into it in more detail, as we already have the yearly rasters with number of missing days it shouldn't be too much work. What would be the best way to present the information? Map for each year showing the coverage? Or for 10-year periods?

Best, Christoph

VeruGHub commented 2 months ago

Ok. Let's be cautious then and not open the door to biased publications, especially if the rasters are easy to update.

Then I think that the best option is to give simply NA when there are some missing data. I would not include the option of na.rm = T/F that Christoph suggested because if would make to have different outputs depending on what you choose, complicate the coding, and the user can always retrieve daily data. I think that we need to add information about missing values in v4 documentation, to prevent users asking about it and to give them some clues about what to do in cases of having NA in monthly values. I think that a map showing the total missing values in the entire series will be useful and also 12 other maps (one per month) showing the average number of missing values per year. I find important not only the spacial aggregation of missing values but the temporal aggregation. What do you think?

I would like to lead the changes in the package, so I ask you to be patient. The next version of the package will not only give access to monthly averages of Tmax, Tmin and Prec, but also it needs to be adapted to monthly Tavg and yearly values (https://github.com/VeruGHub/easyclimate/issues/54). Please, let me know if you have any comment on this.

Also, the new version might need to be adapted to data version v5 (follow the discussion here).

cpucher commented 2 months ago

Agree with not including the na.rm Option.

I think that a map showing the total missing values in the entire series will be useful and also 12 other maps (one per month) showing the average number of missing values per year. The only problem with this is that in case the missing days are clustered (e.g. no data for the first 5 years but then fully coverage) the map could be misinterpreted. Agree on the 12 maps (one per month).

So I put these things on my TODO list:

Pakillo commented 2 months ago

Excellent, thank you both!

cpucher commented 2 months ago

Dear all,

here is the update:

Best, Christoph