Write random forest code to test approach

quantoidb commented 5 years ago

Start to write random forest code to test approach with current ground truth. James had started looking at this.

quantoidb commented 5 years ago

Version #1 of RF code in R completed today. Will continue to refine over the next few days. Then post for you to try it.

quantoidb commented 5 years ago

Prelimminary code for Random Forest models and results comparison (number of variables VS error rate) completed.

Based on the Kelso dataset I have (which has errors in 3 variables, and am waiting on corrected dataset) the RF model performs best (lowest Error Rate = 17.87%) with 25 (of the total 57 variables). Most of these are VV variables.

I think at this point it’s more important to test that the code works, rather than see best possible model results.

NEXT STEPS: 1 – I need a corrected Kelso dataset 2 – I need a pixel-level dataset for Kelso – this so I can start working on the Neural Network model code.

quantoidb commented 5 years ago

Corrected Kelso dataset received - Thank you Chrissy!

quantoidb commented 5 years ago

Continuing to improve the model and validation methods. Latest code v13 uploaded.

geojamesc commented 5 years ago

Beata - for (2) in your next steps, what do you mean by pixel-level dataset?

Is this for a given field for a given timestamp, ALL of the pixels (for all image bands) that fall within the field?

geojamesc commented 5 years ago

William, so to start working on the neural network, Beata needs for input to R, for each field, all of the pixels that fall within that field rather than the summary/zonal stats. This sounds like what you implemented in #33?

quantoidb commented 5 years ago

Here's an image :) of what I am looking for: instead of a mean/variance/range for each field, I need each field's image pixels info.

A field image is made of pixels in 3-channels (red, green, blue). So we need to generate a CSV file containing all fields with the following variables each: FID_ID, LCTYPE, LCGROUP, AREA, pixels 1 to 768 (assuming each image is 64x64 pixels).

quantoidb commented 5 years ago

Alternatively, you could also give me 413 fields (Kelso area) photos in JPG format.

WilliamPetit commented 5 years ago

Are we talking about actual picture of the fields or can VV/VH be converted that way? Or are we talking about sentinel 2 data? I'm not sure if what I've done is helpful or not but I was working with VV/VH and just creating timestamp tif file of each individual fields. It should technically work with sentinel 2 data too...

quantoidb commented 5 years ago

tried 70 train / 30 test VS 50 train /50 test dataset split
difference between Train VS Test Accuracy is smaller in the 50/50 split.
tomorrow I will work on the top variables models

Kelso_Random_Forest_50train-50test_split.pdf Kelso_Random_Forest_70train-30test_split.pdf

geojamesc commented 5 years ago

Wrote a python script https://github.com/cropmapteam/Scotland-crop-map/commit/83e5490aed0486b4ecc799a3235e36169bf4b62d that calls the GDAL gdalwarp commandline tool to crop a raster image to a shapefile giving an output image something like this:

afield

i.e. red outline is the field boundary that the pixels have been output for

The cropped image has the same properties as the input image, i.e. has the 2 bands plus the georeferencing. Running the script in the background to produce data for each of the 413 fields across all images took ~104 mins and produced 30149 individual tiff files.

geojamesc commented 5 years ago

It needs to be decided if an interior buffer should be applied to sample the pixels in the field. I guess we could try two sets of data - one not-buffered and one buffered (possibly at different distances) to see what effects buffering or not buffering has.

geojamesc commented 5 years ago

A zipfile with S1 data clipped to field boundaries is available as S1_data_clipped_to_GT_Polys.zip from here:

https://uoe-my.sharepoint.com/:u:/g/personal/jcrone_ed_ac_uk/EWPZ7267rAxFkTd8rhcwfqwBnYHpvkaQYSWScE-C-PF8ww?e=mDIUrj

The S1_data_clipped_to_GT_Polys.zip has the following contents:

GTFieldPolys subfolder containing

a _ground_truth_v5_2018_inspection_kelso_250619c.shp shapefile which is the 413 feature kelso GT shapefile disaggregated in FME (giving 478 records) so that there is 1 polygon geometry per record rather than the mix of single-part / multipart polygons present in the original shapefile. A new unique GID column has also been added to uniquely identify each field feature.
an _indvpolys folder which contains 1 shapefile for every record in ground_truth_v5_2018_inspection_kelso_250619_c.shp by which to clip the S1 data by

Valid sub-folder contains 23201 S1 image clips. Images have filenames like:

S1B_20180922_30_asc_175817_175842_DV_Gamma-0_GB_OSGB_RCTK_SpkRL_9.tif

this is the original S1 image name plus a _GID suffix in this case _9 which identifies the GID of the field polygon in _ground_truth_v5_2018_inspection_kelso_250619c.shp which the S1 image has been clipped to.

NotValid sub-folder contains 11795 S1 image clips which validation deemed not to be valid as all S1 radar pixels were null, probably because the part of the image that the field boundary intersected with was an area of NoData in the image.

LUT.csv is a lookup table to map gid to ground truth lcgroup/lctype labels.

quantoidb commented 5 years ago

Random Forest best model selected. Results preso sent to the team for comments. Model to be reviewed and tested further once we have more data.

quantoidb commented 5 years ago

Opened new issue #39, for Neural Net model, so we don't mix two different dataset requirements. This issue is now completed. Random Forest R code uploaded.

quantoidb commented 5 years ago

Removing the 5 GRS1-5 classes allows for much better accuracy; it went from 84 to 89%.

RF-noGrass

quantoidb commented 5 years ago

Got final Kelso labelled dataset from James this morning. Re-run Random Forest model. Accuracy at 91.8%

RF_KelsoV5_noGrass_accuracy

quantoidb commented 5 years ago

@geojamesc For future datasets, please use the following variables names (basically the same but in capital letters): Id, FID_1, LCTYPE, LCGROUP, AREA.

Thank you!

geojamesc commented 5 years ago

Done as https://github.com/cropmapteam/Scotland-crop-map/commit/df36b3462f5c7a393d47833442d0895a1c0e2eee

Output CSV is like this:

uppercased_fieldnames

quantoidb commented 5 years ago

@geojamesc It's perfect.

quantoidb commented 5 years ago

Code completed, including generating output CSV files from each of the 3 RF models.

quantoidb commented 5 years ago

example of the output file.

Kelso_classified_crops_result_from_AllVariables_Model

cropmapteam / Scotland-crop-map

Write random forest code to test approach #24