Exploring Larger ghi for Breakpoint identification

After looking into this, using 10,000 gives identical results to using 530, so I don’t see any benefit in going through 1,500, at least not at the HWI level. I think we should see if the same thing happens at the HUC6 level (maybe choose a couple contiguous HUCs to look at).

Update on this for @jdkleiner @jenniferRapp - Explored the a GHI = 1500 for the HUC6 dataset to see if that would be the same as HWI region - pretty much no news (couple teensy things of interest but nothing we can really captialize on IMO) -- brief results described below. More important I think, Jen, I am wondering if you could put into words or pictures what you think a correctly identified breakpoint in a contiguous HUC6 should look like. In fact, thinking along those lines, since we only have a handful of contiguous HUC6’s, I think a hand rendering of where we think those BPs are would be a great start (unless you have made an awesome breakthrough in automatically identifying them). I am starting to think that:

The BPs (even eyeballed) will be lower than other studies/theory of RCC have suggested.
If we have low observed breakpoints, perhaps the follow-on questions of "why" are going to be important.
We need to assess how confident we are in getting to the "why", in terms of what we try to demonstrate.

Brief Analysis of GHI I ran the HUC6s with ghi = 1500, 80% quantile, and it returned identical breakpoints as the ghi = 530 run. Suspicious of this I ran it again with ghi = 150, and that did make a difference with the handful of watersheds whose previous breakpoint was identified as being > 150 (Ex: Lower Tennessee). So, the breakpoint/PWIT was at least doing something, but raising the GHI wasn't finding the location of maximum theoretical NT.

Almost by accident I ran the 90% quantile w/ghi = 1500, and while it largely had the same breakpoints as ghi = 500, there was 1 or 2 exceptions, example: the Upper Tennessee river @ ghi=530 & 1500, Q=80% the BP was 50.1. BUT, at Quantile 90% the BP jumped to 587. And a much nicer fit.

I think there's some real value here. First off if our goal is to trace that edge of the ecological limit, then we really ought to be using the 90% quantile any time there is sufficient data. Though I am surprised that going from 80% to 90% would have such an extreme effect on where its finding the breakpoint.

I'm interested to hear some thoughts on methodology from @passeroe when she's back in a week or so, but here's my quick takeaway: These analyses suggest that out current PWIT method is not designed to truly find a breakpoint/inflection in the data, but rather to maximize the fit of the elf line - regardless of where the breakpoint is (and really it doesn't care where the breakpoint is at all as it does this).

So it looks like we need to continue thinking about:

what does a correctly identified breakpoint look like
if our methods are not correctly identifying these breakpoints, how can we improve our methods
how can we explain breakpoints that are lower than expected

OK - I see your point that at least one takeaway is that the 90% line is important. As for "breakpoint" location, I think that we need to clarify our objectives. I propose:

Breakpoint - this should remain defined as the point separating the two regression lines identified by the PWIT method, and this should be considered to be an important item (tells us where changes occur) but it is NOT the point of maximum richness as a function of flow/area. Therefore, I do not think that we need to do your item 3 - "explain [lower] breakpoints" - in fact, lower breakpoints may be desirable, since the most vulnerable areas appear to be in headwaters and we want the ELF to describe those areas with as much accuracy as possible.
"Richness Peak" (or some other name) - this is the point at which richness becomes limited by flow regime/habitat homogeneity as predicted by RCC. This is where richness either plateaus or begins to decline with increasing area/flow, and the areas that we may recommend be priority for locating consumptive withdrawals. I think that in the interest of both time and clarity we need to manually delimit these on the HUC6 level .
Define our expectation of "Richness Peak" -- Also in the interest of clarity we need to define criteria for evaluating an automated method of finding richness peaks. However, with our limited population of HUC6s, I do NOT think we need to pursue automation right now.
Defining BEST ELF criteria - Have we done this yet? We have talked about highest R2 / lowest p but not firmly declared our decision and documented it. We need to do this and then march forward (probably should set a property on the "best available ELF" (TRUE or FALSE) for every spatial unit/analysis pair.

Is this a public forum? that people outside our USGS/DEQ collaboration can read? This seems to be a research discussion issue, not a code related issue.

Last week I was working through the paper that Elaina and Scotty had initially brought to the group to make sure I understood what the Breakpoint analyses do. There are two methods listed and it looks like we used one of them, but there is a 'segmented approach' method that might be useful, IF we think we need it. The major take away I learned and read in other papers is that we should be bracketing the breakpoint closely. 1 or 72 might be too low, depending on the ELF. At the same time, I think the parallel work that Rob has done to the work I did this week/last week shows that we just may not get the inflection point that we think we want. "Piecewise regression comes about when you have ‘breakpoints’, where there are clearly two different linear relationships in the data with a sudden, sharp change in directionality"(R for Ecologists). That is what it says, but that is not really what we see in our data. It seems to be more gradual in most cases, except for the TN or Ohio River basins. The more I read about the Breakpoint analyses, the more I began to think they would get us 'close' to but not exactly at the point at which we begin to see a change in the distribution of biological data from increasing to decreasing. the R for Ecologists blog states this method, " iteratively searches within the range of breakpoints (Ghi to Glo) for the model that has the lowest residual MSE, using that as our criteria for the best model. " When there is more variability in the data, or we reach the upper end of the X-values and the data are not present, then there is not a clear pattern of change. This would be more obvious if we looked at the extracted data subset of the upper80 quantile points alone. That is what is being evaluated.
https://www.r-bloggers.com/r-for-ecologists-putting-together-a-piecewise-regression/

OF course the 80 and 90 Quantiles are going to provide different breakpoints, because they represent different pools of data. I think for our research purposes and consistency we should stick to the 80th quantile. It provides more data and has been the focus of our analyses this year.

I've been exploring the overall distribution of the datasets a bit in to see if we can select out the peak. this may be a good time to explore the way Kinsey Hoffman went from Habitat Suitability Curves to regression lines. Is there a way to 'go back' to identifying the peak of the curve, or describing the envelope? That could be an approach worth exploring.

Question: Do the QuantReg plots appear to provide the information needed for management if they are limited by an upper limit that is what we believe to be the transition point/breakpoint/inflection point where increasing NT changes to Decreasing NT.

This IS a public forum (though we can pay to make it private), and this is not code per se, but I like discussing scientific investigation questions here, especially as this will allow us to have a traceable background to coding decisions. If USGS has an issue with this we can try to work out another solution but I would prefer not to especially since this doesn't necessitate us going back and forwarding a bunch of emails to analysts that are not on the original thread.

I added your question to my list -- can we come up with answers to the below so we can move forward?

Breakpoint - this should remain defined as the point separating the two regression lines identified by the PWIT method, and this should be considered to be an important item (tells us where changes occur) but it is NOT the point of maximum richness as a function of flow/area. Therefore, I do not think that we need to do your item 3 - "explain [lower] breakpoints" - in fact, lower breakpoints may be desirable, since the most vulnerable areas appear to be in headwaters and we want the ELF to describe those areas with as much accuracy as possible.
"Richness Peak" (or some other name) - this is the point at which richness becomes limited by flow regime/habitat homogeneity as predicted by RCC. This is where richness either plateaus or begins to decline with increasing area/flow, and the areas that we may recommend be priority for locating consumptive withdrawals. I think that in the interest of both time and clarity we need to manually delimit these on the HUC6 level .
Define our expectation of "Richness Peak" -- Also in the interest of clarity we need to define criteria for evaluating an automated method of finding richness peaks. However, with our limited population of HUC6s, I do NOT think we need to pursue automation right now.
Defining BEST ELF criteria - Have we done this yet? We have talked about highest R² / lowest p but not firmly declared our decision and documented it. We need to do this and then march forward (probably should set a property on the "best available ELF" (TRUE or FALSE) for every spatial unit/analysis pair.
Question: Do the QuantReg plots appear to provide the information needed for management if they are limited by an upper limit that is what we believe to be the transition point/breakpoint/inflection point where increasing NT changes to Decreasing NT.
- The "Richness Peak" goal is to tell us, by HUC6 where the transition area is, beyond which withdrawals may have a lesser impact -- at the moment, this is NOT on the QuantReg plots (unless the "breakpoint" just happens to coincide with the "richness peak"), but I think that's OK. The Quantreg plots tell us what the slope of change is at some point below the richness peak. That is the management info that we need when evaluating points below the "Richness Peak", the extent to which the breakpoint is lower than the richness peak may be a follow on question for proposed locations in that no-mans land.

HARPgroup / r-dh-ecohydro

Exploring Larger ghi for Breakpoint identification #26