Area discrepancies between our model and Campagna data

After running a bunch of model exports over the last few days, we have realized that our model is currently identifying about 2000 km^2 less mining area over the period 1976 - 2005 than does Campagna. Of note, this figure takes into account about 1000 km^2 of mining that Campagna found occurring between 1976 and 1985 that our model did not find. Campagna found roughly 4600 km^2 by 2005, whereas we are only finding about 2700 km^2 by that date.

What we can say is that we are very confident about the mining we have currently identified, and our accuracy assessment supports that confidence, we still have a big issue if we're missing so much area as compared to a cited dataset. Additionally, an EPA study from a few years ago (that Matt Ross is aware of) found a similar number to the Campagna result; and Matt thinks an annual rate of new mining should be close to 100 km^2 (we are currently getting about 60). We will still likely want to report our current data (bad science to ignore it otherwise), but we will want to explain why we are running the analysis to better fit the older data. We can make a good argument for this since the Campagna methodology was different than we're doing, so we are using prior research to inform our current results and make sure our model is accurately and comprehensively finding mines.

Just by visually comparing the Campagna results to basemap and other imagery, it does appear that in many cases that data are correct in pointing out mines. There are errors (e.g., identifying urban areas near Wise, VA, as mines), but our dataset will have errors too. The Campagna data, as well, in most cases are limited by the mine permit boundaries, which means extra error in that dataset is not coming from area outside of permits. My current best guess is that Campagna probably overestimated (but not by much), and we assuredly underestimated. Ideally, we would hope to find the sweet spot in between those two extremes.

So, we have a few courses of action:

Relax our thresholds so as to increase the amount of area identified as mine. This will also likely pick up non-mine areas, however.
Explore using other spectral indices (SAVI, EVI, etc.) either in combination with or instead of NDVI. Using an index will keep the model automated, but we will need to create some algorithm for using the results of multiple indices. E.g., a mine is an area with an NDVI < [threshold A] AND SAVI < [threshold B].
See if we can use the Campagna results as a way to guide our own model. For instance, we might use areas identified as mines in that study to arrive at better thresholds. This might be challenging, however, since we don't know exactly when mining occurred with that dataset.
As suggested by @cjthomas730, create a final product that gives levels of confidence about mining. Since we have high confidence about our current data, call those areas "high confidence" or something like that; and then use some of the methods above to identify additional mine land, calling that "medium confidence".

I'm guessing at this point that it's not wise to go down the route of #2, given how exploratory it would be at this late stage. And I definitely don't like #3, as my experience with the Campagna dataset was a lot of false positives. I just realized you all may never have seen Ross Geredien's analysis of that and other datasets. Not sure Ross's techniques were the best out there, but this definitely seems worth sharing:

http://ilovemountains.org/reclamation-fail/mining-extent-2009/Assessing_the_Extent_of_Mountaintop_Removal_in_Appalachia.pdf

But I really like Christian's suggestion (#4) because it covers both bases: providing a fairly conservative dataset for use by researchers while also addressing and acknowledging the underestimation of mined areas. Happy to talk more about this this afternoon.

On Tue, Sep 27, 2016 at 11:12 AM, apericak notifications@github.com wrote:

After running a bunch of model exports over the last few days, we have realized that our model is currently identifying about 2000 km^2 less mining area over the period 1976 - 2005 than does Campagna. Of note, this figure takes into account about 1000 km^2 of mining that Campagna found occurring between 1976 and 1985 that our model did not find. Campagna found roughly 4600 km^2 by 2005, whereas we are only finding about 2700 km^2 by that date.

What we can say is that we are very confident about the mining we have currently identified, and our accuracy assessment supports that confidence, we still have a big issue if we're missing so much area as compared to a cited dataset. Additionally, an EPA study from a few years ago (that Matt Ross is aware of) found a similar number to the Campagna result; and Matt thinks an annual rate of new mining should be close to 100 km^2 (we are currently getting about 60). We will still likely want to report our current data (bad science to ignore it otherwise), but we will want to explain why we are running the analysis to better fit the older data. We can make a good argument for this since the Campagna methodology was different than we're doing, so we are using prior research to inform our current results and make sure our model is accurately and comprehensively finding mines.

Just by visually comparing the Campagna results to basemap and other imagery, it does appear that in many cases that data are correct in pointing out mines. There are errors (e.g., identifying urban areas near Wise, VA, as mines), but our dataset will have errors too. The Campagna data, as well, in most cases are limited by the mine permit boundaries, which means extra error in that dataset is not coming from area outside of permits. My current best guess is that Campagna probably overestimated (but not by much), and we assuredly underestimated. Ideally, we would hope to find the sweet spot in between those two extremes.

So, we have a few courses of action:

Relax our thresholds so as to increase the amount of area identified as mine. This will also likely pick up non-mine areas, however.

Explore using other spectral indices (SAVI, EVI, etc.) either in combination with or instead of NDVI. Using an index will keep the model automated, but we will need to create some algorithm for using the results of multiple indices. E.g., a mine is an area with an NDVI < [threshold A] AND SAVI < [threshold B].

See if we can use the Campagna results as a way to guide our own model. For instance, we might use areas identified as mines in that study to arrive at better thresholds. This might be challenging, however, since we don't know exactly when mining occurred with that dataset.

As suggested by @cjthomas730 https://github.com/cjthomas730, create a final product that gives levels of confidence about mining. Since we have high confidence about our current data, call those areas "high confidence" or something like that; and then use some of the methods above to identify additional mine land, calling that "medium confidence".

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SkyTruth/MTR/issues/102, or mute the thread https://github.com/notifications/unsubscribe-auth/ALCbI2YyAl6_NrfCFWg2-pQkTjMluBIyks5quTJtgaJpZM4KHya0 .

Matthew F. Wasson, Ph.D., Director of Programs Appalachian Voices

589 West King St. Boone, NC 28607 Phone: 828-262-1500 Website: www.appalachianvoices.org

"Nonviolent action, born of the awareness of suffering and nurtured by love, is the most effective way to confront adversity."

Thich Nhat Hanh, Love In Action

SkyTruth / MTR

Area discrepancies between our model and Campagna data #102