Train from STAC DatasetConfig

echeipesh commented 4 years ago

In order to train a user user must provide a DatasetConfig ex:

https://github.com/azavea/raster-vision/blob/c63b2ce21d20b8055fed4cd6595ab6a807523fdf/rastervision_pytorch_backend/rastervision/pytorch_backend/examples/tiny_spacenet.py#L52-L59

This specifies:

Source of training scenes
Source of validation scenes
Per scene label file location
per scene raster location

In addition a ClassConfig is required ex:

https://github.com/azavea/raster-vision/blob/c63b2ce21d20b8055fed4cd6595ab6a807523fdf/rastervision_pytorch_backend/rastervision/pytorch_backend/examples/tiny_spacenet.py#L23-L24

One possible source of above information is from a STAC catalog using label extension

There should be a library-level feature that conveniently provides a way to train a model from such a catalog with minimum configuration. Perhaps a StacDatasetConfig.

Proposed scheme:

Use STAC collections to group multiple STAC label items
Specify which STAC collection to use for training data
Read ClassConfig from STAC collection JSON
Use each label item to create label source
- Each label item links to either GeoJSON labels or raster labels
Use each label item to create raster source
- Each label item links to source imagery to which labels apply

lewfish commented 4 years ago

Specify which STAC collection to use for training data

To clarify, you mean that there will be a STAC collection for the training scenes, and a separate STAC collection for validation scenes?

lewfish commented 4 years ago

As I mentioned the other day, a simple way to implement this might be to have a to_dataset_config() method that parses the underlying STACs and returns a standard DatasetConfig that links to standard SceneConfig objects. After that conversion happens, then everything else in Raster Vision can be implemented the same way as it currently is.

echeipesh commented 4 years ago

I started by creating a sample STAC catalog we can train from to think through the data formatting issues.

See https://gist.github.com/echeipesh/2d2a18b59d634ecbfd97b7d32bba6164

Note that image- and label- prefix in file names indicate that they would be in image/ and label/ subfolders if gist supported sub-folders. All the links are built as if those folders exist.

Separating the images and labels into their own collections is good and very uncontroversial. There is some discussion on how to achieve the training and testing split.

I wanted to avoid using a property that would tag each item as belonging to "testing" or "training" split since that would imply that there could be a single split without re-creating the catalog. So far I'm exploring having sub-catalogs that point to items of either set. Raster Vision could discover the items by being pointed to appropriate sub-catalog and crawling them. Of course each item still refers to its collection.

tiny-spacenet-stac-split

As you can see that ends up with situation that there are two ways to reach each item from root of the catalog. One path is to reach it through collection and another through the training split catalog. I believe this is "OK" but I'm seeking feedback on the idea as a whole (@lewfish).

Edit: After conversation with Lewis we decided it would be better if the training/testing catalog split was a parallel top level catalog that referenced items in the source catalog. This would avoid creating a convention to track multiple splits and would further insulate the source catalog from such changes. Something like:

top_level_split

I think the convention and logic of generating this split might still need some figuring, I'll spend some time to explore those options.

azavea / raster-vision

Train from STAC DatasetConfig #962