Site selection and summary distributions

JamesSample / icpw2

New decade, new workflows

http://www.icp-waters.no/

0 stars 0 forks source link

Site selection and summary distributions #4

Open JamesSample opened 1 year ago

JamesSample commented 1 year ago

@RolfVogt

I have added a notebook here that performs the following operations:

Gets raw data from the database
Calculates derived parameters
Aggregates to annual medians

Site selection

For each of four time periods (1990-2020, 1990-2004, 1998-2012 and 2008-2020), the code then considers the data for each parameter from each station and assesses whether the series meets the following selection criteria:

Series has at least one annual data point within the first 5 years
Series has at least one annual data point within the last 5 years
Series has data for round(0.65*n_years), where n_years is the duration of the period

This is the selection criteria used by Garmo et al. (2020), but please note that it's slightly different to the criteria defined in your Word document (Bestilling.docx).

Results are in the attached Excel file, where a value of 1 means the series meets the criteria for the specified station-parameter-period combination; and a value of zero means the series does not meet the criteria.

In terms of site selection, which parameters are mandatory for your analysis and which are optional? For example, if you filter the Excel file to the period for 1990 to 2020 and then remove zeros from the Ca column, you can see that 476 stations have time series for Ca alone that meet the selection criteria. However, only 425 have data meeting the criteria for all parameters.

Here's a summary for each period:

Period	N_Stns_Ca_Only	N_Stns_All_Pars
1990-2020	476	425
1990-2004	506	447
1998-2012	513	460
2008-2020	492	463

Are you happy to focus on sites that meet the criteria for all parameters?
Are you happy to consider slightly different site selections for each period (as in the table above), or would you like to focus on a fixed subset of sites that meet the criteria for all periods?

selection_criteria_by_station-par-period.xlsx

Data distributions

Finally, please see the boxplots at the end of the notebook (section 5) showing data distributions per parameter and country. Some countries - most notably Latvia - have quite different distributions compared to the rest. Please let me know if you would like to me to exclude, or further investigate, any country-specific datasets based on these plots.

Thanks!

JamesSample commented 1 year ago

@RolfVogt

A correction to the summary above:

Many of the sites have incomplete records for NH4. In my code for calculating ANC, I assumed that NH4 was zero if not reported (as discussed here). My original code simply filled missing values for NH4 with zero, which is OK for ANC but not OK if you're interested in trends in NH4.

If we just fill missing NH4 values with zeros, the table above is correct. However, if I only assume missing NH4 values are zero for the ANC calculation (but do not fill them elsewhere), then lack of NH4 data severely limits the number of sites with "good" series for all parameters.

For example, for the period from 1990 to 2020, there are 425 sites with data meeting the selection criteria for all parameters except NH4 (as shown in the table above), but this reduces to just 243 sites if we require a "good" NH4 record too.

Do you need to consider trends in NH4? If so, do you want to focus on the (much smaller) subset of sites with "good" NH4 data, or is this unnecessarily restrictive?

RolfVogt commented 1 year ago

@JamesSample

I suggest to change the last period to 2006 - 2020 so that all three periods are for 15 years
We do not need to consider trends in NH4. I am thus in favour for going for the 425 sites with data meeting the selection criteria for all parameters except NH4
I am also in favour of sticking to Garmo's selection criteria.
In regards to other paramters then Ca I wish to stick to the 425 that have data meeting the criteria for all parameters (i.e., not the 476 sites.
I am happy to consider slightly different site selections for each period
The distribution values I find puzzeling. SO4 > 2000 µeq/L, NO3 > 200 µeq/L, Cl > 3000 µeq/L, and TOC > 80 mg C/L strikes me as unrealistic. Perhaps we could run a ione balance including HCO3 and Organic anions and discharge samples where the ion balance is > 20 % off? I need to consult Heleen on this... I have sent her an e-mail with yo uas cc

JamesSample commented 1 year ago

@RolfVogt

I'm not sure we can use the ion balance approach here, can we, because we've already used it to estimate HCO3 (here)?

Or have I misunderstood?

Right now, I'm thinking:

$$IB = Ca + Mg + Na + K + HN4 + H - Cl - SO4 - NO3 - OrgAnions - HCO3$$

But we already have

$$ANC = Ca + Mg + Na + K + HN4 - Cl - SO4 - NO3$$

and

$$HOC3 = ANC + H - OrgAnions$$

Substituting these just gives

$$IB = ANC + H - OrgAnions - HCO3 = HCO3 - HCO3 = 0$$

by definition (except in cases where the original calculation of HCO3 gave a value less than zero, in which case it was set to zero).

Please let me know if I've misunderstood.

Thanks!

RolfVogt commented 1 year ago

@JamesSample You are, of course, entirely correct. This would be a fallacy by the logically incorrect inference that it arises by that the accepted conclusion is not a logical consequence of the premises Hopefully, excluding the sites listed in you last e-mail will improve the appearance of the data distribution significantly. Still, samples where [HCO3] = 0 should also have a charge balance. It would be nice to know if there were many of them that are > 20% off, and if excluding them would make a large dent in the database.

JamesSample commented 1 year ago

@RolfVogt

I have updated the notebook here to:

Filter out stations and countries that we think should not be included in the analysis (section 1.2)
Perform an ion balance and remove samples where the difference between cations and anions is more than 20% (section 4, subject to the caveats discussed above)

The original dataset comprises 97k samples. Of these, 17k do not have enough data to calculate an ion balance, 20k have a difference >20% (and are therefore removed) and 60k remain.

The notebook now includes two sets of distribution plots: the original, unfiltered data (section 3), and the filtered data (section 4). The plots in section 4 look a bit better, although there are still some significant anomalies. Calculating annual medians should help to reduce the influence of large outliers in the subsequent analysis, but let me know if you're still suspicious about the raw data distributions in section 4.

Taking the filtered data and applying the site selection criteria for all parameters except NH4 now gives:

1990-2020:    333 sites.
1990-2004:    364 sites.
1998-2012:    379 sites.
2006-2020:    347 sites.

Please take a look at the notebook and let me know if you have any comments.

Thanks!

JamesSample commented 1 year ago

@RolfVogt

Just to note that the selection criteria implemented as described above currently causes Langtjern to be removed from the dataset for all time periods because the ion balance (calculated as described above) is always highly negative (< -20%).

I therefore suspect there is either a bug in my code or a flaw in my approach. I'll investigate to see if I can find the problem...

JamesSample commented 1 year ago

@RolfVogt

After a bit more investigation, the issue appears to be with the estimation of organic anions using the Hruska method.

It seems there are several different methods for estimating organic anions:

In RESA, there is a parameter named OAN- (here), which is calculated as

$$OAN- = TOC (4.7 - 6.87 exp(-0.322 * TOC))$$

For the ICPW QC app, Øyvind Garmo has previously suggested using A-, which he described as, "an empirical simplification of the Hruska model" (here) $$A- = TOC (1.77 pH - 3)$$
For 2023 Thematic Report, I have used the "full" Hruska model implemented here. I have tested this and confirmed that the Python function gives the same values as Rolf's spreadsheet (here) when site_density=13.5 (as in the Excel file)

Are these three method expected to be broadly compatible?

Taking Langtjern as an example, the first two methods broadly agree, but the third (i.e. the "full" Hruska model) gives much higher estimates for the organic anions. This, in turn, leads to negative estimates for HCO3, which are set back to zero leading to large differences in the ion balance.

I'm not familiar with the theoretical basis for these models, but having been at NIVA for several years, I can no longer comprehend doing any data analysis that doesn't involve Langtjern ;-) Something must have gone wrong somewhere!

RolfVogt commented 1 year ago

@JamesSample Just to recap and summarize. All models for organic anions are empirically fitted. The models used in RESA and the ICPW QC app (i.e. Garmo model) are strictly empirical as they hare not conceptually based. Moreover, RESA does not account for differences in pH. Yes, testing the models on the 1000 lakes data (where HCO3- is based on Alkalinity) show that the RESA and Garmo models are correlated (r2 = 0,91), but the coefficient is 1,4, so they do not give the same result. Moreover, both the RESA and Garmo models give a lower value for the organic anion charge so that the average ion balance is + 10% and 4,7% respectively in the 1000 lake dataset. These models are also correlated to Hruska and Oliver (r2 between 0,930 - 0,997), with the best corr. between Garmo and Hruska, Though here the coefficient is 0,6. I.e., the Garmo provides 40% lower results. The Hruska (used in MAGIC) and Oliver are conceptual based and empirically fitted, and they account for differences in pH. Using the fitted (site density) Hruska and Oliver models on the 1000 lakes data gives a very good charge balance with low standard deviation. For some reason I now get an optimal site density of 16.6 for the Hruska model - sorry). The Hruska and Oliver models are very strongly correlated (r2 = 0,9943) with a coefficient of 1. The standard deviation of the ion balance is slightly lower for the Hruska then for the Oliver. I therefore wish to go for the Hruska model. You have tested these models on the Langtjern data. This is one very dystrophic site so that the average optimized site density is likely not optimal for this particular site. This may cause the Hruska and Oliver models to provide too high values for the organic charge at this site. One would have to find the site density that provides the lowest ion balance discrepancy. The site density will for Langtjern probably be closer to 10,2. It sort of makes sense that the more hydrophobic DOM material at Langtjern has a lower site density.

Hope this made things clearer. PS Pls. use a site density of 16.6 in the Hruska model on the ICPW data

RolfVogt commented 1 year ago

1000 Lakes water data.xlsx Here is the file with all the models applied on the 1000 lakes data

JamesSample commented 1 year ago

Hi @RolfVogt,

Data cleaning and site selection

I have changed the site_density parameter to 16.6 and re-run the notebooks.

As agreed, I have also removed the step that filtered samples based on the ion balance, and added a new step for basic outlier detection (using the "double MAD" method - see here).

With this approach, we now have the following site numbers meeting the selection criteria:

1990-2020:   421 sites.
1990-2004:   439 sites.
1998-2012:   457 sites.
2006-2020:   441 sites.
All periods: 403 sites.

We previously agreed that, for consistency, it would be best to use the same sites if possible (i.e. the 403 sites with data for all periods). I am currently working with these, but please note that this criterion removes all sites in Italy and Switzerland, because they don't report TOC until the early 2000s. This is not a problem for the statistical analysis, but may be undesirable for the project as a whole. If we allow site selections to vary by period, the Swiss and Italian sites would be included for the last period (2006 - 2020).

Rough maps showing the 403 sites are shown below (I'll tidy these up once the final selection is decided):

Note that, in North America, our criteria omit sites from the Appalachians and Blue Ridge Mountains, while in Europe we lose sites from the Alps and Baltic regions.

Regional trends

There is only one region and period that shows statistically "significant" regional increasing trends for Ca: Atlantic Canada during the period from 1990 to 2004. All other regions and periods have either no trend, or significant declining trends for this parameter.

I believe the increasing trend for Atlantic Canada from 1990 to 2004 also needs interpreting with caution, as the Ca data for 2001/2 at many of these sites seems unusually high. Here is a typical example (site 38149; Liberty Lake):

As you can see, the overall trend from 1990 to 2020 is not significant, but for 1990 to 2004 it is significant due to the spike in Ca in 2001 and 2002. This spike is consistent across many sites in Nova Scotia and Newfoundland, and it is not large enough to be easily discarded as an "outlier". Either something strange happened at these sites around 2001/2 or (more likely?) a change in labs/methods produced a temporary high.

Overall, I do not see any unambiguous evidence for increasing regional trends in Ca.

Site-specific trends

If we consider Sen's slope using annual median Ca concentrations for the period from 1990 to 2020, there are:

30 sites with significant increasing trends
129 sites with no significant trend
244 sites with significant decreasing trends

The sites with increasing trends are in Atlantic Canada (Nova Scotia and Newfoundland), as well as Norway, Sweden and Slovakia. At the individual site level, there is fairly convincing evidence for modest increasing trends.

Please note that data for the site with the largest increasing trend (38395, Stor-En, in Sweden) is probably not reliable: there is a sudden step-change in the data that I assume indicate (i) a reporting error, (ii) a change in labs/methods, and/or (iii) liming activities.

Results so far

I have added summary data (CSV and Excel files) and plots to the folder here.

JamesSample commented 1 year ago

@RolfVogt

Update regarding TOC/DOC in Switzerland and Italy

While working back through my notes describing previous analyses, I found the following from Øyvind Garmo:

Hi James, I got the comment below from Sandra. How should we respond? “In Figure 3 I wonder from where the TOC data in the Alps region come from. Are these the Swiss DOC data? In that case I would recommend to omit these data from the Figure. First because these are actually DOC and not TOC data. Second because we started to measure DOC only after 2000 and third because we changed analysis method around 2012, that caused an increase of the values after this year).”

This is perhaps a reasonable justification for omitting the Swiss and Italian sites from this analysis? (In the previous report we omitted these data from some parts of the analysis, but kept them elsewhere).

RolfVogt commented 1 year ago

Dear @JamesSample Looking at the maps I worry that we are getting very etnosentric by restraining us to the same sites for all periods. We are also becoming wery sensitive for wirdoes, such 38395, Stor-En, in Sweden. This is clearly due to liming. I am therefore now thinking that it is better to go for varying number of sites for the different periods. This report is mainly about the trends in Ca and Mg. These are not that much goverend by TOC. Moreover, I do not worry that much about DOC or TOC as the difference will not be large - especially for pristine alp sites, This is more a problem when we start getting anthropogenic POC into the watercourse from sewage and agriculture. What I do not like is that the change in analysis method caused in increase in DOC. That is reghardless a strong cause for omitting those data! Perhaps the best now is discuss the issue of site consistency at the TF meeting next week. I will get back to you when we have reached a concensus, Is it possible to ask for an exel sheet of the cleaned data, although with different number of sistes for the different periods.