nelsonic commented 10 months ago

@LuchoTurtle as you've noted in the README.md > What about other models? section:

The bigger the model the more resources consumed and slower the result ... 💰 ⏳ This is your opportunity to do some actual Software Engineering and write-up the findings!

Todo

[x] Create a new file model-comparison.md
[x] Give a brief intro to why someone would want to use a smaller/bigger model (e.g. for a low-stakes demo)
[ ] Deploy all three models:
1. ResNet-50
2. BLIP (base)
3. BLIP large.
[x] Compare the response time and classification string for the same input images.
- [x] Pick 7 sample images. e.g. a Pet, Vehicle, Food, Person, Landscape, Wild Animal, Random (Your Choice).
[x] Create a Summary Table with the columns:
- Model Name
- Size in GB
- RAM required to run it
- Approximate monthly cost on Fly.io (other platforms may be cheaper/more expensive; we're just using Fly.io for illustration purposes...)
- Machine Startup-time from "paused" (cold boot) (just to show the initial page)
- Approximate response time for an image. (put the average in the summary table)
- Sample classification string
[ ] Create a Detail Table with the columns:
- Model Name
- Image thumbnail
- Image Description returned by the classifier/model
- Response time in ms

Each row in the Detail table should be an entry for a given model. Cluster the results together e.g. the Cat/Kitten pick for each model should be together to facilitate comparison.

[!NOTE] Looks like https://huggingface.co/Salesforce/blip-image-captioning-base/tree/main was last updated 11 months ago ...

Can we try https://github.com/salesforce/LAVIS ? 💡

nelsonic commented 10 months ago

@LuchoTurtle you asked on Standup if we should compare "just" these 3 models. I think a "small", "medium" and "large" is a good starting point. But if we get feedback from people on HN (once you post the link 😉) that they want more models compared, then more can easily be added.

LuchoTurtle commented 10 months ago

While it's true that there isn't a de facto leaderboard for image captioning (part of computer vision) tasks like MTEB, there's a reason for it.

From what I've seen, the most regarded benchmark comparison there that puts different models side to side is https://paperswithcode.com/sota/image-classification-on-imagenet

It doesn't, however, have multimodal models (models that can receive multiple types of input), which BLIP is. I can try to get a small benchmark going but I'm afraid I don't know how I can make it "data sciency" and compare accuracy between the models you've suggested.

There are already tools that compare different one-shot models, like https://huggingface.co/spaces/nielsr/comparing-captioning-models.

What I'm thinking is :

[ ] either perhaps comparing the absolute value between embeddings of a dataset ImageNet image and what the model like BLIP yields to check for accuracy.
[ ] checking https://huggingface.co/docs/transformers/v4.35.0/en/tasks/image_captioning#evaluate.

I'll see to it 👌

nelsonic commented 10 months ago

The only thing we want is a real-world comparison. i.e. We wanted to use an existing model to classify images. We compared these 3 models along 3 dimensions: Quality, Speed & Cost. This is way more interesting to a decision maker than the synthetic benchmark/leaderboard. The Massive Text Embedding Benchmark (MTEB) Leaderboard is interesting for Embeddings ...

But your average person has no clue what all the columns in the tables mean. Is a bigger number better or worse? in some cases the "best" model has a worse score than others. How is the ranking calculated?

Anyway, we just want to compare the models that are available to us for the purposes of classifying images. The table will be useful to us and interesting to several thousand other people on HN. 👍

ndrean commented 10 months ago

Just again my few euro, but I already tried microsoft and facebook and found the results bad, when compared to Saleforce/Blip.

microsoft Screenshot 2023-11-15 at 12 32 42

salesforce/blip-base Screenshot 2023-11-15 at 12 36 26

nelsonic commented 10 months ago

Much more descriptive:

But app still takes a very long time to load ... ⏳

LuchoTurtle commented 10 months ago

@nelsonic do you mean to load or to get a description? If it takes time to load, it's probably because the machine was "asleep" and you had to boot it again/"wake it" (because we've set machine instances to sleep after a period of inactivity to save costs). This is very much normal. I've inclusively just opened the link and it loaded instantly.

If there's a problem with the time to load the app from a machine that's asleep, that's another issue entirely. Even then, by caching the models, it takes seconds tops, instead of minutes that would have to wasted by re-downloading the models on every app startup.

nelsonic commented 10 months ago

Yeah, agree that Fly.io machine wake time is a separate issue that isn't really under our control. You've done a good job of caching the model. 👍 We just need to trigger the "wake from sleep" when someone views the README.md as noted in: #11

Meanwhile the descriptions are much better!

ndrean commented 10 months ago

I suppose you know you can set min_machines_running = 1 in the "fly.toml", depends if you want this.

nelsonic commented 10 months ago

Yeah, when we “productionise” this feature, we will set it to be “always on” (min=1) but for now we just want to focus on cold startup time. 👌

dwyl / image-classifier

Feat: Comparing Pre-trained Image Classification Models #12

Todo