Closed michaellee1 closed 1 month ago
First of all, thank you for your interest in the project!
Are you seeing lower tps across all tasks (text generation, image captioning, and batched generation)? Could you run the benchmark()
function and share the results? This will help us compare the performance across different model configurations.
You're correct that the current implementation can uses a fixed number of input tokens for images, regardless of their size. This is by design, based on the model's architecture. The Phi-3 Vision model uses a fixed patch size and a maximum number of patches, which results in a consistent number of tokens for the image input. This approach allows for efficient processing and consistent behavior across various image sizes. Modifying this aspect would require significant changes to the model architecture and would likely require retraining. If you have a specific use case that requires variable token sizes for images, please elaborate, and we can discuss potential approaches.
Sure! I just read your blog series too - great writing 😄 .
Could you run the benchmark() function and share the results?
Yes. See below:
| Task | Vanilla Model | Quantized Model | Quantized Cache | LoRA Adapter |
|-----------------------|---------------|-----------------|-----------------|--------------|
| Text Generation | 14.24 tps | 37.29 tps | 10.33 tps | 12.78 tps |
| Image Captioning | 12.17 tps | 29.22 tps | 2.66 tps | 12.01 tps |
| Batched Generation | 144.16 tps | 108.03 tps | 61.46 tps | 133.16 tps |
My system is a 2021 Macbook Pro, M1, 32GB of memory.
If you have a specific use case that requires variable token sizes for images, please elaborate, and we can discuss potential approaches.
I'm playing around with some user-interaction automation (ie handling UI elements on the screen), where some of the questions are easy. Let's say for example: "Which of these two symbol buttons should I press to achieve X?" This requires a tiny image, and is very simple to answer.
As the user will be waiting, ideally something like this would be answerable in a very short amount of time. I don't think that a smaller model is necessarily the answer as there are a very small amount of needed input and output tokens so it seems like I could use a model like Phi3.5.
Ideally, there would be some way to use more compute for larger images that require the resolution and scope, and proportionally less (at least on input tokens) for smaller images. Do you see any way to address that? The current implementation takes around 17s to process per image, which is unfortunately too slow.
Completely understand if this is out of the scope of the project!
Thanks again for raising these points and for your interest in the project!
Hi - thanks so much for making this repo!
I just ran the benchmark on my 32GB M1 Macbook Pro and I'm getting tps numbers roughly 60% of what was reported. Any idea on what might be going on?
Secondly, it seems like no matter the size of the image, the input tokens used for the image is the same. Is there a way to change the input token size for images?