Closed kariaust closed 1 year ago
An update to the above points after further discussions:
I've made a start on this here. Based on this preliminary exploration, I'd say the most promising time periods to consider are 1990 to 2016 and 2000 to 2016 (see the plot here).
Regarding your points above:
Do we risk getting too short time series if we use the five year criterion also for the time series starting in 2000?
My code for the Mann-Kendall test (like most other implementations) uses some statistical approximations to calculating significance values that are only really valid for n >= ~10. My M-K code will print warning for n < 10, so we'll be able to see if and where this is a problem.
4a. If we have detection limits...
We have some LOD data, but it's patchy. Some Focal Centres don't highlight LOD values at all, and some do, but only sometimes. In several cases, it's pretty obvious from the data that values are at the LOD, but they're not flagged as such. I'll check to see how complete the LOD information is, but I suspect it may be so patchy and inconsistent that it will be difficult to use effectively.
Do you have a response to 4b and 4c as well?
And a nw point 5. What should we do about QC of the less used parameters? Are there any automatic outlier tests that could be used? Or shall we just wait until we discover some strange results..?
Regarding 4b, for the 2019 report, Øyvind and I initially defined much stricter selection criteria to ensure the annual means/medians were broadly representative (see here for full details):
For lakes. Aggregate to seasonal frequency and require that fewer than 25% of the seasonal values within the period from 1995 to 2011 are missing
For rivers. Aggregate to monthly frequency and require that fewer than 25% of the monthly values within the period from 1995 to 2011 are missing
Then calculate annual means/medians for the selected time series
The problem with this is that you end up with data from just a few countries with good monitoring. This reduces our ability to detect spatial patterns and may also be off-putting for Focal Centres (e.g. if they feel their data isn't being used). Unfortunately, I don't think we can do this in a meaningful way unless we're willing to ignore/remove data from many of the Focal Centres.
Regarding 4c, we typically request that Focal Centres only submit "surface" samples, although from the data in the database it seems that wasn't always the case (and, regardless, some FCs add a depth
column to the input template and submit everything anyway).
The vast majority of ICPW data is entered in the database as depth1 = depth2 = 0 m
. For the current preliminary analysis, I'm selecting all samples ("mixed" or otherwise) within 1 m of the surface and then taking annual medians - see the last code cell in Section 2 here.
Finally, regarding QC, we can run some outlier tests etc. if you like. However, my recommendation initially is to begin with the analysis and then see how things develop. I generally end up producing lots of intermediate plots as "sense checks" anyway (histograms, box plots, time series etc.) and these should help to identify at least the worst outliers. The fact that we're aggregating to annual medians (rather than e.g. trying to fit a seasonal model) will make things more robust too.
OK, this sounds reasonable
Not sure if this is ever an issue, but if there are more than one sample within the upper 1 m I would prefer if one value (median?) is selected before taking the annual median