Clay-foundation / model

The Clay Foundation Model (in development)
https://clay-foundation.github.io/model/
Apache License 2.0
262 stars 30 forks source link

Evaluate use of mosaicml-streaming for data pipeline #165

Closed yellowcap closed 4 months ago

yellowcap commented 5 months ago

Streaming is a solution for very large scale multi-node ready data pipeline that is fully integrated with pytorch.

We should evaluate this library, as for v1 the scale of the data will no longer allow the previous approach of downloading all training data to a block storage.

yellowcap commented 5 months ago

I was able to generate mosaicml streaming MDS files from a sample of our current data. I think I understand how the library works and think we can update the pipeline to output MDS files instead of tiff files.

We can generate one set of MDS shards with an index for each MGRS tile. Then we can use the merge_index function to combine those into one main index that the dataloader can use.

So I propose to go ahead and use this for the v0.2 run as a testbed for v1.

yellowcap commented 4 months ago

Initial test have not resulted in speed improvements. Also, we are no longer planning to create prefabricated tiles, but will assume a streaming approach instead. This kind of dataset is not usable for the dynamic chipping scenario.