Do not predict where there is no data in bounding box

ejm714 commented 1 year ago

For some points/item combos, we have a satellite tile but there the bounding box contains entirely no data pixels. We should:

[x] drop these rows in training
[x] drop these rows in prediction
[x] have the prediction be nan for the point if there is no data in the bounding box for any item

This is in line with not using/predicting samples for which there is not imagery.

We can identify these as rows in the satellite data where the values are 0 for all satellite values.

ejm714 commented 1 year ago

This also means regenerating predictions for the test set that we can use in performance metrics (which can happen as part of the "experiment"). We should just remove the predictions/competition_near_water_550m folder since that will become outdated.

ejm714 commented 1 year ago

Relatedly, it looks like we also should be masking out 0's (since this means no data) before we calculate features. Right now, the ranges will be arbitrarily inflated if there are no data pixels since that will force zero as the minimum. We can separately keep track of the number or percent of non data pixels so the model can weigh that info accordingly.

This would get implemented here: https://github.com/drivendataorg/cyanobacteria-prediction/blob/a1b028f297044ba44288113ad8bdff5e79afe865/cyano/data/features.py#L99

import numpy.ma as ma

band_arrays[band] = ma.masked_equal(np.load(sample_item_dir / f"{band}.npy"), 0)

https://numpy.org/doc/stable/reference/generated/numpy.ma.masked_equal.html#numpy.ma.masked_equal

It seems like if one band has no data, neither do any of the others. So to calculate the percent null, seems like we could do this from the first band in config.use_sentinel_bands

drivendataorg / cyfi

Do not predict where there is no data in bounding box #99