SkyTruth / MTR

Mountain Top Removal
Other
8 stars 2 forks source link

Create accuracy assessment sample locations #68

Closed apericak closed 8 years ago

apericak commented 8 years ago

Now that we have created yearly thresholds for Landsat 4 - 8 (see #56 and the table of results), we need to perform yearly accuracy assessments.

Before the assessment, though, we need to establish sampling plots and random sample locations per plot:

1. For each year, establish 10 random plots making sure the site contains at least one active mine in that year. These sites can be the same shape/size as the ones Tita used, for convenience (100 km2, if I remember right?) Ideally these should be randomly placed over our study area, but since we've already done a lot of work to classify the original 3 Tita sites we can probably keep those for those years.

2. Within each sample plot, have GIS/GEE place 1500 random points. For each sample plot, per year, we want 100 points to fall over non-mined areas, and 50 points to fall over active mine areas (in other words, we know there is more non-mined than mine area, so our number of sample points is reflecting that.) Hopefully with 1500 random points we can get this distribution, although in practice if this proves not to be the case, increase the number of random points created.

3. Starting from the top of the points' attribute table, manually classify the points into mine or non-mine. We don't need to worry about whether our classification script labeled the point as mine or not, since we can write another script to do that for us. We just need to know whether a randomly-located point really is or is not an active mine for that year. To do this manual classification, try to use NAIP imagery when possible (matching the year of analysis), but otherwise use Landsat imagery. If you are unsure about whether a point is or isn't an active mine (unclear imagery, right on the border between mine and not-mined, maybe its already in reclamation, etc.), just skip it. Once you have classified 100 non-mine points and 50 mine points, there is no need to classify any further points.

Remember that this process needs to take place for each year of our analysis. So, that means we need to manually classify [150 points per plot] * [10 plots per year] * [32 years for Landsat 4 - 8] = 48,000 points. So, it will be very important that we keep track of these sample locations per year. In case we have to do another accuracy assessment (for example, if we change our thresholds), we will not want to classify all these points again.

Also note that creating these sample points does not require any output from our primary classification script; we will use that script's outputs later to decide whether the script correctly classified mines.

cjthomas730 commented 8 years ago

@apericak so we'll need to establish a new set of random plots for each year we test?

apericak commented 8 years ago

@cjthomas730 I guess the plots themselves don't have to vary, so long as there is at least one mine in the plot for that year. Having a mine is really the big requirement, since otherwise we won't know how well the classifier found mines. The random points within each plot will have to change each time, to keep it sufficiently random, and to account for the fact that certain areas may turn from non-mine to mine and back again.

cjthomas730 commented 8 years ago

@apericak as far as the plot generation, to what degree are selecting the sites? If we want to make sure they all have at least some amount of active mining, will a randomly generating sites give us enough locations? Or will we use a process similar to the classification point generation, that is to say, should we just generate more sample areas until we have 10 that contain mining?

apericak commented 8 years ago

@cjthomas730 First, as reference, I'm offering my best ideas about this process but I don't know that I'm "right" or if there are better ways, so I'm open to hearing your ideas as well.

But to your question, the goal here, in addition to making sure a plot has active mining, is that the plots are relatively (visually) well-distributed across the study extent. [We could run spatial stats to make sure the plots aren't spatially autocorrelated, but I think that's overkill here.] However, we want this process to be as random as possible, so I would say have a GIS randomly locate a plot, then check it to see if it has a mine (throwing out plots that don't); do that for the first five or so plots. For the next five, also have the GIS randomly locate it and then you check to see if it has a mine, but then decide whether or not to keep it based on how close it is to the others / if it will lead to a good distribution of plots across the study area. So the actual plot location is still random, but we're deciding whether to keep certain randomly-located plots so that we are faithfully sampling across our study area, and thus ultimately leading to a more correct accuracy assessment.

cjthomas730 commented 8 years ago

@apericak this all looks good to me. I was thinking though, Tita's study areas were 100 sq. miles in area, might we want to switch to metric units?

apericak commented 8 years ago

Yep sounds good about the metric--for communicating our results to a wider public, we will likely report in miles, but for doing this analysis (especially because we will be getting it published in an academic journal) using metric makes the most sense. Since 10 mi = 16.1 km, should we bump down the plot size to 15 x 15 km? Or was it a circle, in which case we can bump it down to 250 km2?

cjthomas730 commented 8 years ago

It was a circle. 100 square miles is ~ equal to 258 square kilometers. So do you think we should go with 250, 258, or 260 km?

cjthomas730 commented 8 years ago

@apericak I'll send along the sample areas and points shortly, also I think you and I should develop some guidelines for how points are classified

cjthomas730 commented 8 years ago

We have 10 study sites that cover mining activity from 1984-2015, see the image below: qgis_2_8_3-wien

The Sample areas and Study Points (5000 per areas) are linked here: EPSG:5072 Sample-Points-Areas_EPSG-5072.zip EPSG:3857 Sample-Points-Areas_EPSG-3857.zip

cc / @apericak

cjthomas730 commented 8 years ago

@apericak I'm going to close this now that we've got the process established, but I'll reference it in the ticket I'm creating for conducting the point classification