Closed quantoidb closed 5 years ago
Version #1 of RF code in R completed today. Will continue to refine over the next few days. Then post for you to try it.
Prelimminary code for Random Forest models and results comparison (number of variables VS error rate) completed.
Based on the Kelso dataset I have (which has errors in 3 variables, and am waiting on corrected dataset) the RF model performs best (lowest Error Rate = 17.87%) with 25 (of the total 57 variables). Most of these are VV variables.
I think at this point it’s more important to test that the code works, rather than see best possible model results.
NEXT STEPS: 1 – I need a corrected Kelso dataset 2 – I need a pixel-level dataset for Kelso – this so I can start working on the Neural Network model code.
Corrected Kelso dataset received - Thank you Chrissy!
Continuing to improve the model and validation methods. Latest code v13 uploaded.
Beata - for (2) in your next steps, what do you mean by pixel-level dataset?
Is this for a given field for a given timestamp, ALL of the pixels (for all image bands) that fall within the field?
William, so to start working on the neural network, Beata needs for input to R, for each field, all of the pixels that fall within that field rather than the summary/zonal stats. This sounds like what you implemented in #33?
Here's an image :) of what I am looking for: instead of a mean/variance/range for each field, I need each field's image pixels info.
A field image is made of pixels in 3-channels (red, green, blue). So we need to generate a CSV file containing all fields with the following variables each: FID_ID, LCTYPE, LCGROUP, AREA, pixels 1 to 768 (assuming each image is 64x64 pixels).
Alternatively, you could also give me 413 fields (Kelso area) photos in JPG format.
Are we talking about actual picture of the fields or can VV/VH be converted that way? Or are we talking about sentinel 2 data? I'm not sure if what I've done is helpful or not but I was working with VV/VH and just creating timestamp tif file of each individual fields. It should technically work with sentinel 2 data too...
tried 70 train / 30 test VS 50 train /50 test dataset split
difference between Train VS Test Accuracy is smaller in the 50/50 split.
tomorrow I will work on the top variables models
Kelso_Random_Forest_50train-50test_split.pdf Kelso_Random_Forest_70train-30test_split.pdf
Wrote a python script https://github.com/cropmapteam/Scotland-crop-map/commit/83e5490aed0486b4ecc799a3235e36169bf4b62d that calls the GDAL gdalwarp commandline tool to crop a raster image to a shapefile giving an output image something like this:
i.e. red outline is the field boundary that the pixels have been output for
The cropped image has the same properties as the input image, i.e. has the 2 bands plus the georeferencing. Running the script in the background to produce data for each of the 413 fields across all images took ~104 mins and produced 30149 individual tiff files.
It needs to be decided if an interior buffer should be applied to sample the pixels in the field. I guess we could try two sets of data - one not-buffered and one buffered (possibly at different distances) to see what effects buffering or not buffering has.
A zipfile with S1 data clipped to field boundaries is available as S1_data_clipped_to_GT_Polys.zip from here:
The S1_data_clipped_to_GT_Polys.zip has the following contents:
GTFieldPolys subfolder containing
a _ground_truth_v5_2018_inspection_kelso_250619c.shp shapefile which is the 413 feature kelso GT shapefile disaggregated in FME (giving 478 records) so that there is 1 polygon geometry per record rather than the mix of single-part / multipart polygons present in the original shapefile. A new unique GID column has also been added to uniquely identify each field feature.
an _indvpolys folder which contains 1 shapefile for every record in ground_truth_v5_2018_inspection_kelso_250619_c.shp by which to clip the S1 data by
Valid sub-folder contains 23201 S1 image clips. Images have filenames like:
S1B_20180922_30_asc_175817_175842_DV_Gamma-0_GB_OSGB_RCTK_SpkRL_9.tif
this is the original S1 image name plus a _GID suffix in this case _9 which identifies the GID of the field polygon in _ground_truth_v5_2018_inspection_kelso_250619c.shp which the S1 image has been clipped to.
NotValid sub-folder contains 11795 S1 image clips which validation deemed not to be valid as all S1 radar pixels were null, probably because the part of the image that the field boundary intersected with was an area of NoData in the image.
LUT.csv is a lookup table to map gid to ground truth lcgroup/lctype labels.
Random Forest best model selected. Results preso sent to the team for comments. Model to be reviewed and tested further once we have more data.
Opened new issue #39, for Neural Net model, so we don't mix two different dataset requirements. This issue is now completed. Random Forest R code uploaded.
Removing the 5 GRS1-5 classes allows for much better accuracy; it went from 84 to 89%.
Got final Kelso labelled dataset from James this morning. Re-run Random Forest model. Accuracy at 91.8%
@geojamesc For future datasets, please use the following variables names (basically the same but in capital letters): Id, FID_1, LCTYPE, LCGROUP, AREA.
Thank you!
Done as https://github.com/cropmapteam/Scotland-crop-map/commit/df36b3462f5c7a393d47833442d0895a1c0e2eee
Output CSV is like this:
@geojamesc It's perfect.
Code completed, including generating output CSV files from each of the 3 RF models.
example of the output file.
Start to write random forest code to test approach with current ground truth. James had started looking at this.