Storing a URL to the source GeoTIFF file used to create the embedding, e.g. s3://.../.../claytile_32VLM_20221119_v02_0200.tif, for better provenance
How I did it
In the LightningDataModule's datapipe, return a source_url for each GeoTIFF file being loaded
In the LightningModule's predict_step, create a source_url column in the geopandas.GeoDataFrame (in addition to the previous three columns done at #73). A sample table would look like this:
source_url
date
embeddings
geometry
s3://.../.../claytile_*.tif
2021-01-01
[0.1, 0.4, ... x768]
POLYGON(...)
s3://.../.../claytile_*.tif
2021-06-30
[0.2, 0.5, ... x768]
POLYGON(...)
s3://.../.../claytile_*.tif
2021-12-31
[0.3, 0.6, ... x768]
POLYGON(...)
The source_url column is stored in the string[pyarrow] format (which will be the default in Pandas 3.0 per PDEP10)
Each row would store the embeddings for a single 512x512 chip, and the entire table could realistically store N rows for an entire MGRS tile (10000x1000) across different dates.
TODO in this PR:
[x] Save source_url column to GeoParquet file
[x] Rename embeddings file to a format like {MGRS}_{VERSION}.gpq
[x] Refactor to allow multiple workers instead of 1 worker
TODO in the future:
[ ] Sort by ascending date, and remove extra index column?
[ ] Improve the logic of the LightningModule's prediction loop to enable appending to an existing MGRS geoparquet file?
How you can test it
Setup credentials to access the AWS S3 bucket at s3://clay-tiles-02/02/
Run the following commands (ideally in an AWS EC2 instance on us-east-1 where the GeoTIFF files are stored):
What I am changing
embeddings_0.gpq
to a format like{MGRS}_v{VERSION}.gpq
as suggested at https://github.com/Clay-foundation/model/issues/35#issuecomment-1841585520s3://.../.../claytile_32VLM_20221119_v02_0200.tif
, for better provenanceHow I did it
In the LightningDataModule's datapipe, return a
source_url
for each GeoTIFF file being loadedIn the LightningModule's
predict_step
, create asource_url
column in thegeopandas.GeoDataFrame
(in addition to the previous three columns done at #73). A sample table would look like this:The
source_url
column is stored in thestring[pyarrow]
format (which will be the default in Pandas 3.0 per PDEP10)Each row would store the embeddings for a single 512x512 chip, and the entire table could realistically store N rows for an entire MGRS tile (10000x1000) across different dates.
TODO in this PR:
source_url
column to GeoParquet file{MGRS}_{VERSION}.gpq
TODO in the future:
How you can test it
s3://clay-tiles-02/02/
us-east-1
where the GeoTIFF files are stored):32VLM_v01.gpq
under thedata/embeddings/
folderpython trainer.py predict --help
To load the embeddings from the geoparquet file:
Related Issues
Follow-up to #73, addresses https://github.com/Clay-foundation/model/issues/35#issuecomment-1841585520