Closed yellowcap closed 1 month ago
Love this idea. I believe this also restricts us to use COG data on S3 (for speed and window reads) but that's an ok cost to bear imo. We should just make sure we can fall back always to static pre-made batches (important to keep that functionality for finetunning and users who want to continue training potentially offline).
Thanks for the feecback @brunosan
The fallback would be to run this pipeline but store all tiles permanently. That should be fine for smaller datasets, but then maybe this mechanism is overkill and the user is better off to simply create chips on their own scripts. But in any case, the proposed mechanism could be used to permanently store chips as well.
@yellowcap Thanks for doing this write up 🚀 , I think it is a good summary of some of the discussions we had. I'll try to carve out some time this week to add tickets for
A few questions/comments about the approach you've described. On the STAC side
For the required STAC metadata I'd suggest including the projection extension. Many published items already support this and the proj:shape
field would provide an easy mechanism to calculate chipping without accessing the COG's header directly.
Generally, most assets/bands in a STAC Item have the same shape, but this is not a requirement. You'll likely want to handle this case in your chip generation.
Would a batch consist of Item's only from a single sensor or a mix of sensors? If the batch is composed of multiple sensors, how will the Dataloder handle mixed asset/band options? Will it need sensor specific information to determine which assets to combine into a chip?
On the index table side, for optimized performance, we may wish to consider in-lining all of the relevant STAC Item information into columns in a Parquet dataset/files (with the assets dictionary stored in a serialized column) to avoid having to load and parse a very large JSON index file. This may only be valuable when our index size grows to > 1M rows.
On the temporary data loading side, the depth of my Pytorch knowledge is fairly limited, but we will also need to include hooks within the training loop to publish an external notification when a batch is completed so that our infrastructure can remove the assets associated with that batch from fast temporary storage.
We will likely also need some mechanism to "peek" at the next batch for the Dataloader's iterator so that it can be preloaded to disk. (This would achieve our temporary storage "streaming" goals and allow us to use our fast disk capacity efficiently). Hopefully once I get a bit more understanding on the Pytorch side I can provide some more concrete ideas on this 😄
In a very timely post, @jhamman. et al at Earthmover seem to have managed to stream Zarr data directly to the GPU. https://earthmover.io/blog/cloud-native-dataloader/
Here is a proposal for where to get / store the metadata fields that are required by the Clay v1 model in STAC. Everything except the normalization parameters is either native STAC or from the eo and raster extensions.
Each of these values would be required for each band. The assumption is that each band is its own "token" to the model eventually. Some things like bbox
or date
could be stored at the item level instead, and broadcasted to the bands on chip extraction time.
Property | STAC Item Reference | Extension |
---|---|---|
Bounding box | box |
- |
Date | properties:datetime |
- |
Ground Sampling Distance | raster:bands:spatial_resolution |
raster |
Wavelength | eo:bands:center_wavelength |
eo |
Bandwidth | eo:bands:full_width_half_max |
eo |
Normalization Average | properties:normalization_average |
custom |
Normalization Standard Deviation | properties:normalization_standard_deviation |
custom |
@yellowcap I'm writing up a slightly modified version of this approach based on the recent demonstration from the EarthMover team https://github.com/earth-mover/dataloader-demo. I hadn't even considered Dask's in-memory opportunistic caching as a fast intermediate store for pre-fetched chips to maintain high GPU saturation but this is actually very well aligned with our requirements.
One question which will drive the partitioning/row group structure of our chip index parquet file and how we perform sampling for shuffling is what level of reproducibility we need for batches?
With the approach we'll use Dask Dataframe's sample. We have some options for ensuring that our batch partitioning and shuffling is fully reproducible but I'll need to consider this in our pipeline architecture. Do we need fully reproducible batch generation in this case?
Hmm not 100% sure here. I it depends on how we distribute batches across GPU nodes. Maybe we need to reproduce the sample in each node and then select from that, or the sample is created centrally and then pushed to the nodes. In this case reproducibility is less important. But not sure about this, maybe @srmsoumya knows more...
There are two more issues that we have to take into account:
I have been working more on the v1 pipeline, realizing that we will need cloud and nodata tracking at the chip level. I played around with creating classes where the platform specific cloud and nodata filters can be added through class method overrides.
the current code is in the PR linked above. But here are the the specific pointers
https://github.com/Clay-foundation/model/blob/clay-v1-pipeline/scripts/pipeline_v1/chip_indexer.py
script to try them out with the data attached below is
The latests idea is to make copies of scenes that we want to use in a single bucket that can be hooked directly into FSx. Storing 100TB permanently would cost about 14k per month, which is still acceptable given the credits we have.
If we put the scene files, the STAC items, and a central index in a bucket, then FSx can sync that stuff and we can expose it to our training nodes.
The data loader could then do dynamic chipping from the scenes based on the index. This should be quite fast.
If we move with this approach now we gain time to do the "megabatch" part where we handle batches of scenes. I feel like if we do dynamic chipping and indexing on of the chips, that should integrate well with a more batched streaming approach. What do you think @sharkinsspatial
Splitting the indexing and chipping functionality into a separate repo, as this can be done quite univerally I think. https://github.com/Clay-foundation/stacchip
Got a first full version for the new v1 pipeline for Sentinel-2, see
https://github.com/Clay-foundation/stacchip/blob/main/scripts/sentinel-2-processor.py
This will
If we let this run through all 2500 MGRS tiles we have previously sampled we should have all the S2 data we want for v1. We can test this at scale this week and then replicated for other sources!
Results from the a single MGRS tile and one date are here
Ok this is ready for a first serious test drive tomorrow. The index file is written properly to geoparquet and both the STAC item and the assets are created as desired. I'll push this to Batch tomorrow, if things go smoothly we will produce
2500 MGRS tiles for 3 years with 4 seasons each. I.e. 30'000 Sentinel-2 scenes. That should be a good start!
Screenshot from visualization of the geoparquet index using lonboard
Closing this as we have moved forward as documented in https://clay-foundation.github.io/model/data_sampling.html
This is a proposal for a streaming approach that relies on static STAC Indexes and dynamic tiling.
tl;dr
Tracking scenes
The first step is creating a list of scenes from multiple sources, like Modis, Landsat, Sentinel-2, Naip, Drones. Separately, one by one.
Track chips in each scene
Track how many chips are in each scene
The index table
We would then store all the static STAC item as json in files in a central location and index all the chips into a single table.
The pipeline would then do random sampling into batches in two steps:
Considerations:
Data Flow
Considerations