Evaluation Metrics & Process

Ty4Code commented 7 months ago

NOTE: This isn't really an issue, more of a discussion topic / idea that I wanted to raise and get some feedback on to see if there is interest or value in it before I implement something for it. Also a huge thanks to everyone that's contributed to this project & OpenPointClass, lots of amazing work already done so kudos!

Idea: What if we add an additional set of sub-task evaluation metrics that evaluates how well the point cloud classification is able to produce accurate DTMs.

My current understanding is that the evaluation metrics used so far focus on the classification metrics for the point cloud. For example, the metrics found on this PR (https://github.com/OpenDroneMap/ODMSemantic3D/pull/46).

So models are currently evaluated on how accurately they are able to classify points which makes a lot of sense. The only questions that I then have is, how accurate are the DTM models generated using the point cloud classification?

For example, I imagine that we could have two models, M1 and M2. It's quite possible that M1 might have worse point classification precision/recall/accuracy scores compared to M2, but could produce higher quality/more accurate DTMs from the classified point clouds.

For that reason, I thought it might be a good idea to add in a new 'subtask' evaluation routine that is run as follows:

Use the ground-truth classified point clouds to produce a 'ground truth DTM' file for each ground truth cloud using the pc2dem.py script.
Using the current model under evaluation, run the point cloud classification routine as normal but run the outputs through the pc2dem.py script to produce a 'predicted DTM file'
Finally, we use the 'ground truth DTM' and the 'predicted DTM' outputs from steps 1 & 2 and perform some type of evaluation routine to compare the predictions to ground truth.

This would produce a new set of 'DTM estimation metrics' that would be complementary to the current set of 'point cloud classification metrics'. I would like to hear what others think, does this seem like a useful addition that could be pulled/merged in, or does it not align with the current goals of the project & dataset?

pierotofy commented 7 months ago

It's an interesting idea, but how would that differ than comparing the metrics for the "ground" class?

Ty4Code commented 7 months ago

It's an interesting idea, but how would that differ than comparing the metrics for the "ground" class?

Well for example, imagine you have two models, M1 and M2 that you have trained to classify point clouds.

Maybe you calculate the precision/recall of M1 for classifying 'ground' points, and it has 90% precision and 90% recall.

Also let's say you calculate the precision/recall of M2 for classifying 'ground' points and it has 80% precision and 80% recall.

So for the purpose of 'ground' point classification, M1 is clearly a better model than M2.

However, let's say we use both of these models to classify point clouds and then generate a DTM from it. Is is quite possible that M1 classifies more points correctly as 'ground', BUT the points that it misclassifies as 'ground' might have a huge elevation delta that would result in large DTM errors.

Example: Imagine you have a point cloud with 1000 points, where 500 points are from the ground and 500 points are from a building on top of the ground, where most of the points lie along the walls & roof surface of the building.

Now imagine model M1 is able to classify 450 of the ground points as 'ground' and classifies 50 points from the building roof surface as 'ground'. So the precision is 90% and the recall is 90%.

Maybe M2 is only able to classify 400 of the ground points as 'ground' and classifies 100 points from the building walls close to the ground as 'ground'. So the precision is 80% and the recall is 80% for classifying ground points.

So when it comes to classifying ground points, model M1 is clearly better than M2. However, if our goal is to generate a DTM, then we can see that clearly M2's DTM will be much more accurate because the building points that it mis-classified were close to the ground so the DTM gap-fill will predict a flat DTM close to the elevation of the real ground.

On the other hand, we will see that the DTM produced from M1's predictions will be much worse because it will classify the elevation of the roof as terrain/ground surface and will have large errors in the DTM map produced.

So that's just a random example to show how a model can have better ground point classification metrics but produces worse DTM models.

That's the general motivation behind adding the sub-task metrics. So we can directly understand how good the DTMs are, and try to optimise for models that have better DTM metrics even if it has worse 'ground' classification accuracy. But curious to hear your thoughts if this makes sense or aligns with the dataset goals.

Ty4Code commented 7 months ago

I think my example might have been a bit hard to follow with text, I wish I had some visualisations.

Just to summarise shortly: The current metrics for "ground" class treat all points the same. So if you mis-classify a point as 'ground' when it was a rooftop or a treetop or a small bucket on the ground, then that has the same 'error' in the metric.

But when we care about generating a DTM, it's much 'worse' (has larger cost/error) to mis-classify a treetop as ground then it is to mis-classify a small bucket on the ground. But if we only look at the current metrics for "ground" class, it would say there is no difference.

The current metrics are useful and should be kept, but there's an old saying that what is not measured cannot be improved. In this case, if we are not measuring the final DTM accuracy then who's to say that any new models trained/released are actually improving the DTM? Maybe a new model has better ground class metrics but is actually producing worse DTMs for ODM but we would never know unless we measure the DTM metrics.

Ty4Code commented 7 months ago

Actually just had another idea in a similar vein @pierotofy , but in LightGBM there is an option to provide sample weightings during training.

So you could easily add in a weighting that is calculated from the ground truth DEM, so that during training the model will be able to learn that it is worse to mis-classify points with large elevation deltas from the terrain/ground. So the model would be able to better learn those patterns and might be able to generate higher quality DTM models without adding any new training data at all.

I'm also curious, have you experimented at all with hyperparameter tuning? I noticed that the current learning rate and 'num_leaves' parameters are hard-coded, and I wonder if there could be some easy gains in performance with a search over those parameters to find the best ones? Not sure if you've already done this already.

pierotofy commented 7 months ago

That makes sense, thanks for the explanation.

It could be an interesting addition.

I have not played with hyperparameters much. We'd welcome improvements in this area as well.

Ty4Code commented 7 months ago

Quick update! I put together a python script that can be run with something like:

python3 evaluate_opc_dtm.py --input_point_cloud /data/ground_truth_pc.laz --input_opc_model /data/opc-v1.3_model.bin It will:

Load the input point cloud and generate a DTM and a DSM using pc2dem.py
Run pcclassify with the input model on the point cloud and then generate a DTM from the re-classified cloud using pc2dem.py
Run a bunch of evaluation metrics to compare the 'predicted DTM' to the 'ground truth DTM'.

As output it can save a stats file with json 'dtm evaluation metrics' and also can save some graphs which is helpful for debugging the DEMs and the errors your model is getting.

Questions:

Does this make sense to add somewhere into the pipeline & does it align with the projects goals?
If it does, do you have any recommendations on how best to integrate it? The biggest issue is that the pipeline requires pc2dem.py which requires ODM repository as a dependency which seems like it might be an infeasible addition to this repo as a dependency? Curious to hear your thoughts.

Adding some extra info below on the evaluation metrics I came up with for anyone that might be interested and want to discuss or provide suggestions. These were just my best initial guesses for metrics that would measure how 'good' or 'useful' a predicted DTM is compared to a ground truth DTM.

Evaluating Predicted DTMs

NOTE: Skip this section if you're not interested in the evaluation metrics definitions To evaluate the 'ground truth DTM' compared to the 'predicted DTM', it first aligns and/or expands to ensure both DEMs are aligned and have the same shape.

The evaluation metrics produced:

MAE: The mean absolute error of each DEM cell compared between ground-truth DTM and predicted DTM. (e.g. MAE=0.5 means that for any location on the DTM, our prediction is 'off' by 50cm on average)
RMSE: The root mean squared error of each DEM cell compared between ground-truth DTM and predicted DTM.
q95AE: The 95th quantile of the absolute error of each DEM cell compared between ground-truth DTM and predicted DTM (e.g. q95AE=0.8 means that our prediction is 'off' by less than 80cm for 95% of the DTM surface area)
q99AE: Same as q95AE but for the 99th quantile
MAX_ERR: The maximum error for a single DEM cell compared between ground-truth DTM and predicted DTM

Finally, for each evaluation metric, I also re-computed the evaluation metric using the DSM as a baseline. So for example for MAE, we treat the DSM as a 'predicted DTM' and calculate the MAE of lets say 4.5m and we see that for our actual predicted DTM we have an MAE of 0.9m. In that case, we can consider our 'MAE_relative-dsm' to be 80%.

On this 'relative DSM' scale, every metric is divided by the metric computed for the DSM by using (1 - pred_metric / dsmpred_metric) and has a scale where 100% would mean our model is perfect and 0% would mean our model is the exact same as just using the DSM.

This relative-to-dsm metric seems helpful because it lets us compare across different point clouds which might have different scales or levels of difficulty.

Example metrics for OPC V1.3 run on odm_data_toledo.laz: DTM Prediction Metrics

DEM Cell Count: 23.0M
Mean Absolute Error: 0.153016m
Root Mean Squared Error: 0.41m
Maximum Error: 13.35m

Prediction relative 'mae' is: 88.52% (model error 0.15m compared to DSM error 1.33m) Prediction relative 'rmse' is: 89.37% (model error 0.41m compared to DSM error 3.86m) Prediction relative 'max_error' is: 34.92% (model error 13.35m compared to DSM error 20.52m) Prediction relative 'q95ae' is: 93.29% (model error 0.76m compared to DSM error 11.29m) Prediction relative 'q99ae' is: 88.49% (model error 1.91m compared to DSM error 16.60m)

pierotofy commented 7 months ago

I think it might make sense for this to live as a separate effort (at least initially), due to the ODM dependency.

I would recommend to publish the script in a separate repo, then add instructions on how to run the method on the README here.

OpenDroneMap / ODMSemantic3D

Evaluation Metrics & Process #47

Evaluating Predicted DTMs

Example metrics for OPC V1.3 run on odm_data_toledo.laz: DTM Prediction Metrics