Closed ejm714 closed 1 year ago
This also means regenerating predictions for the test set that we can use in performance metrics (which can happen as part of the "experiment"). We should just remove the predictions/competition_near_water_550m
folder since that will become outdated.
Relatedly, it looks like we also should be masking out 0's (since this means no data) before we calculate features. Right now, the ranges will be arbitrarily inflated if there are no data pixels since that will force zero as the minimum. We can separately keep track of the number or percent of non data pixels so the model can weigh that info accordingly.
This would get implemented here: https://github.com/drivendataorg/cyanobacteria-prediction/blob/a1b028f297044ba44288113ad8bdff5e79afe865/cyano/data/features.py#L99
import numpy.ma as ma
band_arrays[band] = ma.masked_equal(np.load(sample_item_dir / f"{band}.npy"), 0)
https://numpy.org/doc/stable/reference/generated/numpy.ma.masked_equal.html#numpy.ma.masked_equal
It seems like if one band has no data, neither do any of the others. So to calculate the percent null, seems like we could do this from the first band in config.use_sentinel_bands
For some points/item combos, we have a satellite tile but there the bounding box contains entirely no data pixels. We should:
This is in line with not using/predicting samples for which there is not imagery.
We can identify these as rows in the satellite data where the values are
0
for all satellite values.