Feature brainstorming - Githubissues

dmarx commented 2 years ago

[ ] loss scoring
[x] multi-perceptor
[x] weighted multi-perceptor
[ ] cutout methods? + augs? make that an independent library maybe?
[ ] perceptor weight interpolations/schedules - https://discord.com/channels/729741769192767510/730484623028519072/956979309686423602
[x] API should be agnostic wrt media type, i.e. contrasting modalities could both be text, or one be audio and other video, etc.
[ ] optionally augment w positional information/embeddings?
[ ] Maybe some minimal translation API to facilitate use by non-english users and conversely support for non-english models
- see aphantasia's SBERT utilization: https://github.com/eps696/aphantasia
[x] Check for installed/available CLIP, use vendored if not available

dmarx commented 2 years ago

let's not boil the ocean. goals for MVP:

centralize stuff
simplify downloading and installing
- assuming target use case is notebooks: don't even need to worry about updating or checking for existing models for immediate use case

MVP is basically just a git clone --recurse-submodules with maybe a few bells and whistles.

dmarx commented 2 years ago

imagining usage...

import perceptors as pct

pct.available_models() # list all models
pct.available_models('clip') # pattern match

clip_rn50 = pct.Perceptor('clip_rn50') # load a model
clip_vit16 = pct.Perceptor('clip_vit16') # load another

# combine models for multi-clip
multi_clip = clip_rn50 + clip_vit16

# adjust model-specific weight
multi_clip.set_weight('clip_vit16', .1) # set weight by name
multi_clip.set_weight(0, .5) # set weight by index

# manage models
multi_clip += pct.Perceptor('clip_rn101') # add another model algebraically
multi_clip.bind('clip_vit32') # add another clip model by name
multi_clip.unbind('clip_vit16') # dissociate a bound model by name

text = clip_rn50.tokenize_text('foo bar')
text_emb = clip_rn50.embed_text('foo bar')

img_emb = clip_rn50.embed_image('path/to/image')
img_emb = clip_rn50.embed_image(img: torch.Tensor)
img_emb = clip_rn50.embed_image(img: PIL.Image)

multi_clip.embed_text('foo bar')
multi_clip.embed_image(img: ...)

apolinario commented 2 years ago

One small issue people had when they were adding SLIP to many different text-to-image notebooks and code-bases was that the input resolution wasn't part of the model

So you see things like this on Disco Diffusion for e.g.:

#when using SLIP Base model the dimensions need to be hard coded to avoid AttributeError: 'VisionTransformer' object has no attribute 'input_resolution'
          try:
                input_resolution=model_stat["clip_model"].visual.input_resolution
          except:
                input_resolution=224

I feel having a default but user-changeable input resolution per model model if the model itself doesn't present one could be part of the feature-list

dmarx commented 2 years ago

100%, I already encountered this issue with other CLIP providers too. I tracked down the code snippet in the original openai release that calculates this, but I like the idea of a default attribute too

apolinario commented 2 years ago

Another point in reference to usage: I feel there could be two ways of using it. One way very similar to how you wrote at imagining usage, but the other I feel it could be identical to OpenAI's CLIP. It may be the case that this wouldn't allow for some of the fancy combinations of perceptors (although I feel this could be bridged), but on the other hand this would allow for a snappy adoption.

Someone could just replace the from CLIP import clip to from mmc import clip and everything would work automatically with a bunch of more perceptors out of the box. Could be a way to entry to then say "hey now that you are using this library, why not replace your custom multi-perceptor code with this one"

dmarx commented 2 years ago

This is a great idea. I've noticed that there seem to be two "families" of CLIP implementations: codebases based on openai/CLIP, and codebases based on huggingface's CLIP.

Rather than changing the classes we have now, maybe we could add a wrapper class or decorator for specifying if a user wants an interface that resembles a common model family. This way, we could keep using the modality-agnostic system and leverage similar wrappers for making drop-in-able tools for tasks beyond TTI.

Is that contrived? How this might look:

my_mmc = ...loading code
my_mmc = mmc.api_wrappers.openai_clip(my_mmc)

Or actually... i guess there's no reason we couldn't go a step further and wrap the multi-mmc to make convenience classes that are pinned to specific modalities and emulate the desired APIs. I think this is closer to what you originally had in mind.

The more I think about this, the more I like it.

On Tue, Apr 19, 2022, 07:31 apolinario @.***> wrote:

Another point in reference to usage: I feel there could be two ways of using it. One way very similar to how you wrote at imagining usage, but the other I feel it could be identical to OpenAI's CLIP. It may be the case that this wouldn't allow for some of the fancy combinations of perceptors (although I feel this could be bridged), but on the other hand this would allow for a snappy adoption.

Someone could just replace the from CLIP import clip to from mmc import clip and everything would work automatically with a bunch of more perceptors out of the box. Could be a way to entry to then say "hey now that you are using this library, why not replace your custom multi-perceptor code with this one"

— Reply to this email directly, view it on GitHub https://github.com/dmarx/Multi-Modal-Comparators/issues/4#issuecomment-1102728236, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALGEAJ5JDOGEJIOU7M76Z3VF27SHANCNFSM5S6LZGDA . You are receiving this because you authored the thread.Message ID: @.***>

dmarx / Multi-Modal-Comparators

Feature brainstorming #4