Closed echeipesh closed 1 year ago
- Specify which STAC collection to use for training data
To clarify, you mean that there will be a STAC collection for the training scenes, and a separate STAC collection for validation scenes?
As I mentioned the other day, a simple way to implement this might be to have a to_dataset_config()
method that parses the underlying STACs and returns a standard DatasetConfig
that links to standard SceneConfig
objects. After that conversion happens, then everything else in Raster Vision can be implemented the same way as it currently is.
I started by creating a sample STAC catalog we can train from to think through the data formatting issues.
See https://gist.github.com/echeipesh/2d2a18b59d634ecbfd97b7d32bba6164
Note that image-
and label-
prefix in file names indicate that they would be in image/
and label/
subfolders if gist supported sub-folders. All the links are built as if those folders exist.
Separating the images and labels into their own collections is good and very uncontroversial. There is some discussion on how to achieve the training and testing split.
I wanted to avoid using a property that would tag each item as belonging to "testing" or "training" split since that would imply that there could be a single split without re-creating the catalog. So far I'm exploring having sub-catalogs that point to items of either set. Raster Vision could discover the items by being pointed to appropriate sub-catalog and crawling them. Of course each item still refers to its collection.
As you can see that ends up with situation that there are two ways to reach each item from root of the catalog. One path is to reach it through collection and another through the training split catalog. I believe this is "OK" but I'm seeking feedback on the idea as a whole (@lewfish).
Edit: After conversation with Lewis we decided it would be better if the training/testing catalog split was a parallel top level catalog that referenced items in the source catalog. This would avoid creating a convention to track multiple splits and would further insulate the source catalog from such changes. Something like:
I think the convention and logic of generating this split might still need some figuring, I'll spend some time to explore those options.
In order to train a user user must provide a
DatasetConfig
ex:https://github.com/azavea/raster-vision/blob/c63b2ce21d20b8055fed4cd6595ab6a807523fdf/rastervision_pytorch_backend/rastervision/pytorch_backend/examples/tiny_spacenet.py#L52-L59
This specifies:
In addition a
ClassConfig
is required ex:https://github.com/azavea/raster-vision/blob/c63b2ce21d20b8055fed4cd6595ab6a807523fdf/rastervision_pytorch_backend/rastervision/pytorch_backend/examples/tiny_spacenet.py#L23-L24
One possible source of above information is from a STAC catalog using label extension
There should be a library-level feature that conveniently provides a way to train a model from such a catalog with minimum configuration. Perhaps a
StacDatasetConfig
.Proposed scheme:
ClassConfig
from STAC collection JSON