Rename embeddings file to include MGRS code and store GeoTIFF source_url

What I am changing

Improve usability of the GeoParquet embeddings file by:
1. Renaming the file from a generic embeddings_0.gpq to a format like {MGRS}_v{VERSION}.gpq as suggested at https://github.com/Clay-foundation/model/issues/35#issuecomment-1841585520
2. Storing a URL to the source GeoTIFF file used to create the embedding, e.g. s3://.../.../claytile_32VLM_20221119_v02_0200.tif, for better provenance

How I did it

In the LightningDataModule's datapipe, return a source_url for each GeoTIFF file being loaded

In the LightningModule's predict_step, create a source_url column in the geopandas.GeoDataFrame (in addition to the previous three columns done at #73). A sample table would look like this:

source_url	date	embeddings	geometry
s3://.../.../claytile_*.tif	2021-01-01	[0.1, 0.4, ... x768]	POLYGON(...)
s3://.../.../claytile_*.tif	2021-06-30	[0.2, 0.5, ... x768]	POLYGON(...)
s3://.../.../claytile_*.tif	2021-12-31	[0.3, 0.6, ... x768]	POLYGON(...)

The source_url column is stored in the string[pyarrow] format (which will be the default in Pandas 3.0 per PDEP10)
Each row would store the embeddings for a single 512x512 chip, and the entire table could realistically store N rows for an entire MGRS tile (10000x1000) across different dates.

TODO in this PR:

[x] Save source_url column to GeoParquet file
[x] Rename embeddings file to a format like {MGRS}_{VERSION}.gpq
[x] Refactor to allow multiple workers instead of 1 worker

TODO in the future:

[ ] Sort by ascending date, and remove extra index column?
[ ] Improve the logic of the LightningModule's prediction loop to enable appending to an existing MGRS geoparquet file?

How you can test it

Setup credentials to access the AWS S3 bucket at s3://clay-tiles-02/02/

Run the following commands (ideally in an AWS EC2 instance on us-east-1 where the GeoTIFF files are stored):

# Train the model
python trainer.py fit --trainer.max_epochs=10 \
                      --trainer.precision=bf16-mixed \
                      --data.data_path=s3://clay-tiles-02/02/32VLM \
                      --data.num_workers=8
# Generate embeddings GeoParquet file
python trainer.py predict --ckpt_path=checkpoints/last.ckpt \
                          --trainer.precision=bf16-mixed \
                          --data.batch_size=1024 \
                          --data.data_path=s3://clay-tiles-02/02/32VLM \
                          --data.num_workers=0

This should produce a geoparquet file named 32VLM_v01.gpq under the data/embeddings/ folder
Sample file (need to unzip, about 2.9MB uncompressed): 32VLM_v01.gpq.zip

Extra configuration options can be found using python trainer.py predict --help

To load the embeddings from the geoparquet file:

import geopandas as gpd

geodataframe: gpd.GeoDataFrame = gpd.read_parquet(path="32VLM_v01.gpq")
assert geodataframe.shape == (823, 4)  # 823 rows, 4 columns
print(geodataframe)

    source_url                                          date            embeddings                                          geometry
0   s3://clay-tiles-02/02/32VLM/2017-05-19/claytil...   2017-05-19  [-1.0804343, -1.1861055, 0.2579711, -1.1242834...   POLYGON ((5.46822 60.34364, 5.46324 60.38953, ...
1   s3://clay-tiles-02/02/32VLM/2017-05-19/claytil...   2017-05-19  [-1.081955, -1.1901798, 0.2592258, -1.1241777,...   POLYGON ((5.56081 60.34607, 5.55596 60.39196, ...
2   s3://clay-tiles-02/02/32VLM/2017-05-19/claytil...   2017-05-19  [-1.0853468, -1.1995519, 0.26269174, -1.127272...   POLYGON ((5.65341 60.34844, 5.64870 60.39433, ...
3   s3://clay-tiles-02/02/32VLM/2017-05-19/claytil...   2017-05-19  [-1.0773537, -1.1837404, 0.25767463, -1.119480...   POLYGON ((5.74603 60.35074, 5.74145 60.39663, ...
4   s3://clay-tiles-02/02/32VLM/2017-05-19/claytil...   2017-05-19  [-1.0771247, -1.187013, 0.26040226, -1.124507,...   POLYGON ((5.83867 60.35297, 5.83421 60.39887, ...
... ... ... ... ...
818 s3://clay-tiles-02/02/32VLM/2019-08-27/claytil...   2019-08-27  [-1.0937738, -1.1862404, 0.26832822, -1.123034...   POLYGON ((7.18770 59.45848, 7.18524 59.50443, ...
819 s3://clay-tiles-02/02/32VLM/2019-08-27/claytil...   2019-08-27  [-1.0931807, -1.1811237, 0.26974052, -1.117826...   POLYGON ((7.27798 59.45970, 7.27564 59.50566, ...
820 s3://clay-tiles-02/02/32VLM/2019-08-27/claytil...   2019-08-27  [-1.0908315, -1.1857345, 0.26635545, -1.121208...   POLYGON ((7.36827 59.46086, 7.36605 59.50682, ...
821 s3://clay-tiles-02/02/32VLM/2022-11-19/claytil...   2022-11-19  [-1.0904396, -1.2076643, 0.26954767, -1.134142...   POLYGON ((7.24451 60.10306, 7.24206 60.14901, ...
822 s3://clay-tiles-02/02/32VLM/2022-11-19/claytil...   2022-11-19  [-1.0872881, -1.2177591, 0.27005005, -1.140369...   POLYGON ((6.42729 59.95170, 6.42372 59.99763, ...

823 rows × 4 columns

Clay-foundation / model

Rename embeddings file to include MGRS code and store GeoTIFF source_url #86

What I am changing

How I did it

How you can test it

Related Issues