cloudnativegeo / geo-embeddings-survey

A survey of use cases and current data schemas for vector embeddings in geoparquet
Apache License 2.0
20 stars 4 forks source link

Additional Questions on Embeddings #6

Open jasongilman opened 2 months ago

jasongilman commented 2 months ago

This issue explores questions beyond the specific field to use within GeoParquet, focusing on standardization for consistent consumer expectations and building common tools for operations like search or classification.

Chipping

Questions about breaking down input imagery into smaller areas or "chips" for AI/ML processing:

  1. How do you determine the ideal chip size for a given resolution? a. Consider the trade-off between relevance (not too big) and discernibility (not too small) b. How does chip size impact row count and data size?

  2. What tiling approach should be used for the chips?

    • Options: a. Chopping every NxN pixels based on the chip size b. Using a defined grid (e.g., H3 or Major Tom)
  3. How do you deal with the Modified Areal Unit Problem (MAUP) a. How to handle edges within chips that cut across areas? b. Would it make sense to have chips overlap?

Storage and Distribution

  1. What should the guidance be for file size? a. What is the ideal granularity of original scenes to GeoParquet files (containing rows of chips)?

    • A very simple approach of 1:1 would have one scene translated to a single geoparquet file.
    • There may be benefits to using alternatives that allow for larger or smaller geoparquet files.
  2. Is there an ideal partitioning scheme within object storage?

  3. Alternatives to GeoParquet: a. Is GeoParquet preferred over Pytorch or numpy style files? b. Has anyone considered putting embeddings in Zarr?

Searching and Analysis

  1. Searching directly from GeoParquet: a. Are users considering searching directly across GeoParquet stored in object stores, or do they prefer copying into a vector database? b. Would capturing vector indexes in the GeoParquet enable efficient searching?

  2. Are there any implications on the end analysis that would impact how embeddings are stored in GeoParquet?