Open cwickham opened 3 years ago
I'm confused because it seems that they used both test and training cities from the competition (Tuia 2017 and Yokoya 2017). This brings in the issue of getting LCZ reference data/"truth" for the test cities, since it's not publicly published. I could email and try to get it, or we could just focus on using the training cities since those have the "truth" data we can use to verify?
Hong Kong is training city so still a good candidate for a scaled down replication.
Since they're input data is just pulled from USGS Earth Explorer, is it an adequate reproduction to just download those files based on the dates they've given? Would it make more sense to use the reference data provided by the competition instead? I'm unclear on why they used their own input data except maybe that it's more recent.
Charlotte suggests using competition data to avoid extra data pre-processing steps. Yoo et al. might have had reasons (e.g. better prediction performance?) for a different down/up-sampling scheme, but we won't worry about that for now.
How does bilinear resampling work? I see how you actually do it, they're are just options in QGIS to do it so that you can match resolutions together when combining data, but don't have a great understanding of what it actually does.
Not as relevant if we don't have to pre-process data. Some reading:
Why doesn't the whole area have a LCZ classification?
Probably something to do with how they got the classification from human experts. If you want a better answer, it probably means digging in to how the LCZ "truth" was generated.
How do we go from the competition data files to a data frame/tibble that you can feed into a random forest function?
Each row in that data frame is a pixel (Charlotte is pretty sure). One column will contain the response: the LCZ class. The other columns are the explanatory variables, the value for that pixel in each the LandSat bands. Also keep track of polygon IDs.
In practice, I think the steps will roughly be (this reference from your list looks useful)):
Make sure you add/keep columns for LCZ class, and polygon ID.
Probably the raster package will have the tools for that.
You'll have to split the competition training data into your own training/test split, so you can honestly evaluate your models (since you only know LCZ classifications for the training polygons). Re-reading this part in Yoo et al. - how did they split?
I went through the example from Chris Holden and it made me wonder why our LCZ data is in raster files instead of polygon shapefiles, and if that's advantageous for some reason. It seems like the best way to use randomForest
is to convert all the data into a dataframe, which we talked about. But I'm not sure which method makes the most sense (or if it really matters):
brick()
brick()
brick()
extract()
(like in example)Start with Option B. Option A is the back up plan if Option B seems too memory intensive.
Looking at the data my understanding is that:
_col
) and the other is black and white and I think geometry and topology (_GT
) . SO why are there two LCZ tiff files and why does it seem like they don't say much (see ___ for histograms)? What do those numbers even mean (i.e. what lcz classes do they correspond to)?
Charlotte suspects the four bands are: Red, Green, Blue and alpha. Combined these give the LCZ classes.
Ericka says looks like _GT
file has one band, and value is the class. Make histogram of this band in the _GT
and try to confirm it looks like the LCZ classes.
Does it make sense that I would get rid of the zeroes while running the random forest since they don't have a value for lcz?
Yes, you won't need to the pixels with LCZ of zero (missing LCZ) to build the models, but you will need them later for prediction. Build the data frame with all LCZ pixels, then subset after.
What to do with the four LandSat days? Combine? Pick one?
The data isn't exactly clipped to Hong Kong, it includes some of Shenzhen. There's one polygon that overlaps.
Let's not worry about clipping, you might need to talk about "the Hong Kong region".
There's a discrepancy in the numbers of pixels in [Table 3]() compared to the numbers of pixels I get when I do it.
Yes, this is important place to verify your replication of their work. You should talk about it in the paper. But, make a decision on how you'll do it, document that, and move on.
Does the "class imbalance" here matter?
A useful baseline overall accuracy, is to think predicting every pixel to be the most represented class.
How does this influence the model fit? Can you tell if this is a problem from the output? E.g. how does overall accuracy break down across the LCZ classes?
I don't think they really explained how they divided the polygons up. It seems straightforward except that I only have the data in pixels rather than in polygons, and they mention in the paper that using pixels from the same polygon for both training and test artificially inflates accuracy assessments.
Is there a way to identify which polygon a pixel is in, like a polygon ID? If not, you need to assign it. I don't know how to do that off the top of my head: shift from raster to shapefile, do point in polygon type operation; figure how to draw a boundary around pixels that are adjacent and of the same LCZ class.
Not feeling great about starting methods because I'm not sure how much detail to go into about random forest(s?) specifically vs. what I actually did with the data. The examples I've seen of peoples' final reports seem to go either way.
What does the output mean?
Err on the side of depth of explanation, we can trim later for the report.
Everything I've read says that Out-Of-Bag estimates are just as accurate as cross-validation, so I don't understand why they used cross-validation.
I think this is a good question, and good observation. It might be interesting to compare test data accuracy to the Out-Of-Bag accuracy estimates.
One thought is that the OOB error is going to be based on random samples of pixels, but pixels within the same polygon as probably similar, and this may over estimate the accuracy for new pixels in new polygons - the test/train split of polygons doesn't have that problem.
Another thought is the OOB error is used for tuning. If you use the data to fit many forests with different parameters and pick the one with the lowest OOB error, that OOB error is no longer an unbiased estimate of the true error, and you need another unseen set of data.
What about the 90/10 split inside the training data?
We are assuming this split in on pixels. Why not use OOB error to choose parameters here? Can you get their OA metric on the OOB samples directly from randomforest output? If you can, I'd just use the OOB estimates for parameter tuning...
Once a variable has a node can it be reused on a different branch? Feels like it has to be? But never on the same branch? I'm having a hard time conceptualizing what all is in the pile of variables that are randomly selected at each node.
I don't think my preprocessing actually matches up with theirs... (but does it even matter because it'd be a rabbit hole?)
Replicate "as close as practical", point differences where you think they exist.
Conceptual, is fine for now, I'll point out if you need to firm it up when I review.
nodesize
, maxnodes
.Rmd
and here as .pdf
. These don't actually say anything different, but the formulas are legible and I just wanted to see it all together.First
If time:
Charlotte:
mtry=51 or 1ntree=700
) and returns OA, OAurb, OAnat, and F-1return()
part of the function set up in the way that makes the most sense.
map_drf()
...
to get the name of the parameter varied into the tibble
A place to keep a record of what we talk about.
2020-12-09
Updates
First three tasks from #1 complete.
Questions
How should Ericka send Charlotte questions/updates before meeting?
Use these issues.
Was McNemar's test appropriate here?
Charlotte doesn't know. Let's keep this in the back of our minds:
Action Items
From
todo.md
: