Save embeddings with spatiotemporal metadata to GeoParquet

What I am changing

Storing the vector embeddings alongside some spatial bounding box and datetime information in a tabular GeoParquet format, instead of an npy file!

How I did it

In the LightningModule's predict_step, use geopandas to create a GeoDataFrame with three columns - date, embeddings, geometry. A sample table would look like this:

date embeddings geometry

2021-01-01 [0.1, 0.4, ... x768] POLYGON(...)

2021-06-30 [0.2, 0.5, ... x768] POLYGON(...)

2021-12-31 [0.3, 0.6, ... x768] POLYGON(...)
The date is stored in Arrow's date32 format, embeddings are in FixedShapedTensorArray (TODO), and geometry is in WKB.
Each row would store the embedding for a single 256x256 chip, and the entire table could realistically store N rows for an entire MGRS tile (10000x1000) across different dates.

date	embeddings	geometry
2021-01-01	[0.1, 0.4, ... x768]	POLYGON(...)
2021-06-30	[0.2, 0.5, ... x768]	POLYGON(...)
2021-12-31	[0.3, 0.6, ... x768]	POLYGON(...)

TODO in this PR:

[x] Save embeddings to GeoParquet
[x] Improve docstring

TODO in the future:

[ ] Ensure embeddings are saved as FixedShapeTensorArray? (see https://github.com/Clay-foundation/model/pull/73#discussion_r1419757638)

How you can test it

Locally, download some GeoTIFF data into the data/ folder, and then run:

python trainer.py fit --trainer.max_epochs=10 --trainer.precision=bf16-mixed --data.data_path=data/46REU --data.num_workers=4  # train the model
python trainer.py predict --ckpt_path=checkpoints/last.ckpt --data.batch_size=1024 --trainer.precision=bf16-mixed --data.num_workers=0  # generate embeddings

This should produce an embedding_0.gpq file under the data/embeddings/ folder
Sample file (need to unzip, about 3.0MB uncompressed): embeddings_0.gpq.zip

Extra configuration options can be found using python trainer.py predict --help

To load the embeddings from the geoparquet file:

import geopandas as gpd

geodataframe: gpd.GeoDataFrame = gpd.read_parquet(path="embeddings_0.gpq")
assert geodataframe.shape == (2, 3)
print(geodataframe)

        date            embeddings                                          geometry
0   2022-12-12  [-1.1094263, 1.0212796, -0.58915687, -1.144523...   POLYGON ((93.02647 30.71001, 93.02648 30.73311...
1   2022-12-12  [-1.1253564, 1.0260286, -0.5860151, -1.1528502...   POLYGON ((93.34729 30.70955, 93.34738 30.73265...
2   2022-12-12  [-1.1190275, 1.0268829, -0.59865385, -1.147052...   POLYGON ((93.74777 30.63856, 93.74794 30.66166...
3   2022-12-12  [-1.1115837, 1.0286477, -0.60599935, -1.143061...   POLYGON ((93.80119 30.63824, 93.80138 30.66134...
4   2022-12-12  [-1.1172316, 1.0246403, -0.59833527, -1.143900...   POLYGON ((93.82790 30.63808, 93.82810 30.66118...
... ... ... ...
750 2022-12-12  [-1.11294, 1.0265714, -0.6015097, -1.1443343, ...   POLYGON ((93.40048 30.64010, 93.40057 30.66320...
751 2022-12-12  [-1.1207774, 1.029693, -0.5964609, -1.1490294,...   POLYGON ((93.45391 30.63992, 93.45402 30.66302...
752 2022-12-12  [-1.1309807, 1.0274287, -0.57653224, -1.162805...   POLYGON ((93.58748 30.63939, 93.58762 30.66249...
753 2022-12-12  [-1.1268965, 1.0305986, -0.59025705, -1.154876...   POLYGON ((93.61420 30.63926, 93.61434 30.66236...
754 2022-12-12  [-1.1171025, 1.0268872, -0.60177326, -1.146309...   POLYGON ((93.69434 30.63886, 93.69450 30.66196...

755 rows × 3 columns

If you have a newer version of QGIS, it's also possible to load the GeoParquet file directly. The below screenshot shows the bounding box locations of the 755 embeddings (1 embedding for each 256x256 chip):

Related Issues

Extends #56, continuation of #66.

Clay-foundation / model

Save embeddings with spatiotemporal metadata to GeoParquet #73

What I am changing

How I did it

How you can test it

Related Issues