azavea / raster-vision

An open source library and framework for deep learning on satellite and aerial imagery.
https://docs.rastervision.io
Other
2.09k stars 388 forks source link

Train from STAC DatasetConfig #962

Closed echeipesh closed 1 year ago

echeipesh commented 4 years ago

In order to train a user user must provide a DatasetConfig ex:

https://github.com/azavea/raster-vision/blob/c63b2ce21d20b8055fed4cd6595ab6a807523fdf/rastervision_pytorch_backend/rastervision/pytorch_backend/examples/tiny_spacenet.py#L52-L59

This specifies:

In addition a ClassConfig is required ex:

https://github.com/azavea/raster-vision/blob/c63b2ce21d20b8055fed4cd6595ab6a807523fdf/rastervision_pytorch_backend/rastervision/pytorch_backend/examples/tiny_spacenet.py#L23-L24

One possible source of above information is from a STAC catalog using label extension

There should be a library-level feature that conveniently provides a way to train a model from such a catalog with minimum configuration. Perhaps a StacDatasetConfig.

Proposed scheme:

lewfish commented 4 years ago
  • Specify which STAC collection to use for training data

To clarify, you mean that there will be a STAC collection for the training scenes, and a separate STAC collection for validation scenes?

lewfish commented 4 years ago

As I mentioned the other day, a simple way to implement this might be to have a to_dataset_config() method that parses the underlying STACs and returns a standard DatasetConfig that links to standard SceneConfig objects. After that conversion happens, then everything else in Raster Vision can be implemented the same way as it currently is.

echeipesh commented 4 years ago

I started by creating a sample STAC catalog we can train from to think through the data formatting issues.

See https://gist.github.com/echeipesh/2d2a18b59d634ecbfd97b7d32bba6164

Note that image- and label- prefix in file names indicate that they would be in image/ and label/ subfolders if gist supported sub-folders. All the links are built as if those folders exist.

Separating the images and labels into their own collections is good and very uncontroversial. There is some discussion on how to achieve the training and testing split.

I wanted to avoid using a property that would tag each item as belonging to "testing" or "training" split since that would imply that there could be a single split without re-creating the catalog. So far I'm exploring having sub-catalogs that point to items of either set. Raster Vision could discover the items by being pointed to appropriate sub-catalog and crawling them. Of course each item still refers to its collection.

tiny-spacenet-stac-split

As you can see that ends up with situation that there are two ways to reach each item from root of the catalog. One path is to reach it through collection and another through the training split catalog. I believe this is "OK" but I'm seeking feedback on the idea as a whole (@lewfish).

Edit: After conversation with Lewis we decided it would be better if the training/testing catalog split was a parallel top level catalog that referenced items in the source catalog. This would avoid creating a convention to track multiple splits and would further insulate the source catalog from such changes. Something like:

top_level_split

I think the convention and logic of generating this split might still need some figuring, I'll spend some time to explore those options.