New alchemy forms - clip image feature extraction, clip text encode

tazlin commented 6 months ago

There are use cases for being able to do client-side manipulation of the various intermediate results of the clip interrogation process.

To compare an image to text via CLIP, the following happens:

The text is encoded into features. open_clip uses clip_model.encode_text(text_tokens). This returns a tensor.
The image "features" are extracted by using the CLIP model. open_clip uses clip_model.encode_image(...). This returns a tensor.
The tensors are normalized.
The image features and the text features are compared.
A similarity score is assigned and returned.

This feature request would allow steps 1 + 2 to be returned independently, optionally as part of a regular interrogate request, or separately on their own without the need to load a CLIP model locally - they could perform the math pertinent to their use case in slow/limited RAM environments. Certain types of image-searching/database schemes could benefit from this.

I propose the following forms be added:

encode_text
- Accepts a list of strings and the value of a supported CLIP model.
- For each string returns a .safetensors file containing the encoded text tensor and which model was used to encode it.
encode_image
- Accepts a source_image and the value of a supported CLIP model.
- Returns a .safetensors file containing the encoded image features and which model was used to encode it.

This proposal has the obvious wrinkle of needing to support the upload of .safetensors files. The size of these files is on the order of magnitude of single-digit kilobytes.

rbrtcs1 commented 6 months ago

A useful feature might be to opt into including the resulting image embeddings with an image generation request.

I.e. in the /generate/status/ endpoint, each generation result would include an r2 url containing that image’s calculated embedding safetensor file.

That being said, it’s easily avoidable by just doing the alchemy request separately, and I imagine this request would be more difficult to set up.

db0 commented 6 months ago

I think we might avoid using R2 here and just b64 the safetensors in the DB. couple-kb data per file shouldn't be a terrible amount and if bandwidth starts being choked due to these I can always switch to R2 later.

Haidra-Org / AI-Horde

New alchemy forms - clip image feature extraction, clip text encode #356