JosefAlbers / Phi-3-Vision-MLX

Phi-3.5 for Mac: Locally-run Vision and Language Models for Apple Silicon
https://medium.com/@albersj66
MIT License
233 stars 16 forks source link

Running benchmark at around 50% speed + image tokenization questions #10

Closed michaellee1 closed 1 month ago

michaellee1 commented 1 month ago

Hi - thanks so much for making this repo!

I just ran the benchmark on my 32GB M1 Macbook Pro and I'm getting tps numbers roughly 60% of what was reported. Any idea on what might be going on?

Secondly, it seems like no matter the size of the image, the input tokens used for the image is the same. Is there a way to change the input token size for images?

JosefAlbers commented 1 month ago

First of all, thank you for your interest in the project!

1. Regarding the lower tps (tokens per second) numbers:

Are you seeing lower tps across all tasks (text generation, image captioning, and batched generation)? Could you run the benchmark() function and share the results? This will help us compare the performance across different model configurations.

2. Regarding input tokens for images:

You're correct that the current implementation can uses a fixed number of input tokens for images, regardless of their size. This is by design, based on the model's architecture. The Phi-3 Vision model uses a fixed patch size and a maximum number of patches, which results in a consistent number of tokens for the image input. This approach allows for efficient processing and consistent behavior across various image sizes. Modifying this aspect would require significant changes to the model architecture and would likely require retraining. If you have a specific use case that requires variable token sizes for images, please elaborate, and we can discuss potential approaches.

michaellee1 commented 1 month ago

Sure! I just read your blog series too - great writing 😄 .

Could you run the benchmark() function and share the results?

Yes. See below:

| Task                  | Vanilla Model | Quantized Model | Quantized Cache | LoRA Adapter |
|-----------------------|---------------|-----------------|-----------------|--------------|
| Text Generation       |  14.24 tps     |  37.29 tps      |  10.33 tps       |  12.78 tps    |
| Image Captioning      |  12.17 tps     |  29.22 tps      |  2.66 tps       |  12.01 tps    |
| Batched Generation    |  144.16 tps     |  108.03 tps      |  61.46 tps       |  133.16 tps    |

My system is a 2021 Macbook Pro, M1, 32GB of memory.

If you have a specific use case that requires variable token sizes for images, please elaborate, and we can discuss potential approaches.

I'm playing around with some user-interaction automation (ie handling UI elements on the screen), where some of the questions are easy. Let's say for example: "Which of these two symbol buttons should I press to achieve X?" This requires a tiny image, and is very simple to answer.

As the user will be waiting, ideally something like this would be answerable in a very short amount of time. I don't think that a smaller model is necessarily the answer as there are a very small amount of needed input and output tokens so it seems like I could use a model like Phi3.5.

Ideally, there would be some way to use more compute for larger images that require the resolution and scope, and proportionally less (at least on input tokens) for smaller images. Do you see any way to address that? The current implementation takes around 17s to process per image, which is unfortunately too slow.

Completely understand if this is out of the scope of the project!

JosefAlbers commented 1 month ago
  1. Regarding the lower tps numbers: My benchmark was run on a 2022 Mac Studio with the M1 Max and 64GB RAM, so this could be why you're seeing lower numbers on your system.
  2. Regarding input tokens for images: Interesting idea about variable image processing. I've pondered it for the past many days, but haven't yet found a good solution given the model's architecture.

Thanks again for raising these points and for your interest in the project!