Hyper-Pretrained Transformers (HPT) is a novel multimodal LLM framework from HyperGAI, and has been trained for vision-language models that are capable of understanding both textual and visual inputs. HPT has achieved highly competitive results with state-of-the-art models on a variety of multimodal LLM benchmarks. This repository contains the open-source implementation of inference code to reproduce the evaluation results of HPT on different benchmarks.
We release HPT 1.5 Edge as our latest open-sources model tailored to edge devices. Despite its size (<5B), Edge demonstrates impressive capabilities while being extremely efficient. We release HPT 1.5 Edge publicly at Huggingface and Github under the Apache 2.0 license.
pip install -r requirements.txt
pip install -e .
You can download the model weights from HF into your [Local Path] and set the global_model_path
as your [Local Path] in the model config file:
git lfs install
git clone https://huggingface.co/HyperGAI/HPT1_5-Edge [Local Path]
You can also set other strategies in the config file that are different from our default settings.
After setting up the config file, launch the model demo for a quick trial:
python demo/demo.py --image_path [Image] --text [Text] --model [Config]
Example:
python demo/demo.py --image_path demo/einstein.jpg --text 'What is unusual about this image?' --model hpt-edge-1-5
Launch the model for evaluation:
torchrun --nproc-per-node=8 run.py --data [Dataset] --model [Config]
Example for HPT 1.5 Edge:
torchrun --nproc-per-node=8 run.py --data MMMU_DEV_VAL --model hpt-edge-1-5
For HPT 1.5 Edge
HPT 1.5 Edge
Pretrained LLM: Phi-3-mini-4k-instruct
Pretrained Visual Encoder: siglip-so400m-patch14-384
HPT 1.5 Air
Pretrained LLM: Llama3-8B-Instruct
Pretrained Visual Encoder: siglip-so400m-patch14-384
HPT 1.0 Air
Pretrained LLM: Yi-6B-Chat
Pretrained Visual Encoder: clip-vit-large-patch14-336
Note that the HPT Air is a quick open release of our models to facilitate the open, responsible AI research and community development. It does not have any moderation mechanism and provides no guarantees on their results. We hope to engage with the community to make the model finely respect guardrails to allow practical adoptions in real-world applications requiring moderated outputs.
This project is released under the Apache 2.0 license. Parts of this project contain code and models from other sources, which are subject to their respective licenses and you need to apply their respective license if you want to use for commercial purposes.
The evaluation code for running this demo was extended based on the VLMEvalKit project. We also thank OpenAI for open-sourcing their visual encoder models, 01.AI, Meta and Microsoft for open-sourcing their large language models.