Clay-foundation / earth-text

Adding language to Clay
Apache License 2.0
12 stars 3 forks source link

Clarifications for use of text-to-earth work #10

Closed lauracchen closed 5 months ago

lauracchen commented 5 months ago

Hi @yellowcap @rramosp @alkalait !

I want to summarize my understanding here of this workstream so that you can correct as needed, for the purposes of making sure our app team is on the same page with how to integrate these outputs. Thank you for your help in advance!

The basic text-to-earth model will be trained with OSM tags as inputs and outputs embeddings. The contract includes training via a frozen encoder (text to earth model as downstream task) and an unfrozen encoder method (text to earth and Clay vision models trained in parallel), with the same outputs (embeddings). I'm not sure - will Clay's current embeddings also be an input in the frozen encoder case? Does the text-to-earth model (decoder) actually update the embeddings, or does it just train a model to associate tags with a set of existing embeddings?

So in terms of how we'd translate this for app use once the model is trained - in theory, an app user could input a tag, we'd pass that tag through the text-to-earth model (decoder). They would then get one or a few embeddings back, and then we perform similarity search to get the rest of the results. I'm also assuming this is the same if it's trained as a downstream task or if the text-to-earth model is trained in parallel.

Does this sound right?

This was included in the text-to-earth scope - my above summary assumes this remains the same - A user could then perform a search using the following steps: 1. Select a set of tags to search for 2. Based on the tags, find images that contain these tags 3. Manually select the images that match the intended content the best 4. For the matched images, find similar images based on embeddings

We could also see a combination of text search where one could look for image chips that contain one tag, while having images within a search radius with a second tag. This would allow us to find two phenomena based on geographic proximity

cc @stephen-downs @agsafchuk @danhammer @MaceGrim

alkalait commented 5 months ago

Hello @lauracchen!

The approach we're following is slightly different to what you've described. You are right in that we're extracting appropriate OSM tags for each chip (in California for now), and from that we derive ground-truth labels about the presence of the tags.

First I'll describe the multi-label classification approach.

It's a multi-label classification task because each of the 90-or-so labels/OSM tags can be present in a chip, and each chip can have multiple labels present. Therefore, each chip's ground truth is a vector of 1's and 0's.

We'll use a frozen Clay v0.2 model to encode all image chips. At the same time, we'll train a separate "head" whose task is to input an image-chip embedding and predict the ground-truth of each chip – the aforementioned binary vector. Assuming we do this successfully, every image chip would have an associate set of predicted OSM labels. Then against those, the user can search for the chips (and associated locations) that match the user's chosen set of tags, as described in step 1.

We'll also plan to try another, embedding-similarity, approach.

Also with a frozen Clay vX.Y model, we'll train a separate encoder whose task is to input the aforementioned binary vector of OSM tags and output a text-embedding that resembles the image-embedding of the associated image-chip.

How would this translate to the app?

The user would chose their search tags, which would then be converted to an appropriate binary vector. Then the binary vector would be transformed to a text embedding via the aforementioned encoder. Then we would run a similarity-based search for the best image-chip-embeddings.

Let me know if you have any questions :)

rramosp commented 5 months ago

thanks Laura, Freddy ... just to complement what freddy wronte, the reason by which we are doing this two step approach (multilabel classification and the embedding similarity) is that we are not sure how meaningful each osm tag can really be with respect to the actual satellite imagery.

For instance, there are many tags for small objects (as compared with a sentinel2 pixel), such as houses, certain tracks or highways, etc. However, there are many chips that contain many of those small objects (such in a city), so although they are not individually identifiable, they do form patterns in the chip.

We are not sure how a model might react to such cases, and our intuition is that with the multilabel stage we could be able to judge what specific tags work better or worse with S2. If we did directly the embedding similarity we would not be able to distinguish if a certain set of tag embeddings don't work well because the embeddings are poor, or because objects cannot be identified well in the satellite imagery.

hope that helps. let us know anything else you might need!!!!

agsafchuk commented 5 months ago

This is really helpful context. Thank you so much.

@rramosp @alkalait Is it possible to get a list of those 90ish tags you are working with so we can do some UI conecpting around them?

brunosan commented 5 months ago

I think these are the ones:

https://github.com/Clay-foundation/earth-text/blob/main/src/earthtext/osm/multilabel.py#L35-L56

lauracchen commented 5 months ago

Thank you both!! This is super helpful^! So for a second take at clarifying the process that the app would use, is this right?

For the first version:

  1. We produce binary vectors for each chip
  2. User chooses search tags
  3. Convert to binary vector
  4. We surface any matching vectors/chips

If this is right, will you be providing the vectors set in 1 and the code in 3 and 4?

For the second version:

  1. we produce text embedding versions for every chip over a certain region
  2. User chooses search tags
  3. Convert to binary vector
  4. encode binary vector as text embedding
  5. Similarity search using the entry text embedding. (Do you think this should be searching a space that combines the image and text embeddings? Or just searching a text embedding DB?)

If this is right, will you be providing the code that performs 3, 4, and 5? And the embeddings mentioned in 1?

agsafchuk commented 5 months ago

Another question...Will we be able to identify where inside a chip we are seeing a binary "true" for a specific tag or do we just get a chip returned with a similarity value?

Example: Can we visually highlight the boundaries of an area labeled as "river" on a satellite image or can we just say this chip has a river in it somewhere?

MaceGrim commented 5 months ago

OSM also has the polygons for these things. Here's a little lake near me in South Dakota. If the data exists in OSM, it seems right that we could show those polygons.

image

alkalait commented 5 months ago

For the first version:

  1. We produce binary vectors for each chip
  2. User chooses search tags
  3. Convert to binary vector
  4. We surface any matching vectors/chips

Correct!

If this is right, will you be providing the vectors set in 1 and the code in 3 and 4?

Yes.

For the second version:

  1. we produce text embedding versions for every chip over a certain region
  2. User chooses search tags
  3. Convert to binary vector
  4. encode binary vector as text embedding
  5. Similarity search using the entry text embedding. (Do you think this should be searching a space that combines the image and text embeddings? Or just searching a text embedding DB?)

Correct. On 5, this would be searching a space where the text embeddings have been trained/fit to match the visual embeddings. So yes, in effect it would be searching a space that combines image and text embeddings. In principle, any future OSM tag-based search query would lie somewhere in this joint embedding space. And the name of the game would be to find the "best" chip, e.g. whose embedding is the closest to the embedding of the user's input.

If this is right, will you be providing the code that performs 3, 4, and 5? And the embeddings mentioned in 1?

Yes.

alkalait commented 5 months ago

Another question...Will we be able to identify where inside a chip we are seeing a binary "true" for a specific tag or do we just get a chip returned with a similarity value?

Example: Can we visually highlight the boundaries of an area labeled as "river" on a satellite image or can we just say this chip has a river in it somewhere?

I can't say with confidence this will be possible, even though it's not out of the question. These setups we've discussed are not explicitly designed for localising the detected tag, and segmenting it out for the user. This ML task is known as segmentation, and it requires a more laborious ground truth, in which (you've guessed it) each object of interest is segmented out in a binary map 2D map (e.g. each pixel is river-or-not), which fortunately is already present in the OSM polygons as @MaceGrim points out.

However, since each chip is further patched out onto a 16 × 16 grid, we might be able to devise an ad-hoc mechanism in which we ask: which of the 256 (=16²) patches of the chip excites the "river" output of the neural network the most. Also, a tranformer naturally lends itself to this ad-hoc mechanism. So there is some hope.

That said, we recommend focusing on the easier tasks of mere multi-label classification at first.

rramosp commented 5 months ago

hi all,

to answer @lauracchen 's questions

  1. there is already an embedding space which is the one given by clay model v0.2 on chip images
  2. each chip has a set of osm tags associated with it
  3. we will produce a text version of those tags (maybe using some prompt to an existing llm or something)
  4. we will train a text model that given the text tags for a chip will attempt to predict embeddings produced by claymodel v0.2 on the image chip. It is like attempting to take text tags of a chip to the already existing embedding space induced by clay model v0.2
  5. we produce all embedings of image chips using clay model v0.3
  6. then, given a user selection of tags we use the text model to obtain its embeddings and retrieve the image chips whose image embeddings are most similar

we will get as far as we can for this second version

to answer @agsafchuk :

hope this helps, let me know if you need further clarification

lauracchen commented 5 months ago

Thank you @alkalait @rramosp for these clarifications!!