drivendataorg / cyfi

Estimate cyanobacteria density based on Sentinel-2 satellite imagery
https://cyfi.drivendata.org/
MIT License
21 stars 4 forks source link

Do not predict where there is no data in bounding box #99

Closed ejm714 closed 1 year ago

ejm714 commented 1 year ago

For some points/item combos, we have a satellite tile but there the bounding box contains entirely no data pixels. We should:

This is in line with not using/predicting samples for which there is not imagery.

We can identify these as rows in the satellite data where the values are 0 for all satellite values.

ejm714 commented 1 year ago

This also means regenerating predictions for the test set that we can use in performance metrics (which can happen as part of the "experiment"). We should just remove the predictions/competition_near_water_550m folder since that will become outdated.

ejm714 commented 1 year ago

Relatedly, it looks like we also should be masking out 0's (since this means no data) before we calculate features. Right now, the ranges will be arbitrarily inflated if there are no data pixels since that will force zero as the minimum. We can separately keep track of the number or percent of non data pixels so the model can weigh that info accordingly.

This would get implemented here: https://github.com/drivendataorg/cyanobacteria-prediction/blob/a1b028f297044ba44288113ad8bdff5e79afe865/cyano/data/features.py#L99

import numpy.ma as ma

band_arrays[band] = ma.masked_equal(np.load(sample_item_dir / f"{band}.npy"), 0)

https://numpy.org/doc/stable/reference/generated/numpy.ma.masked_equal.html#numpy.ma.masked_equal

It seems like if one band has no data, neither do any of the others. So to calculate the percent null, seems like we could do this from the first band in config.use_sentinel_bands