【Hackathon 7th】Fundable project 5. 前沿文档多模态大模型飞桨复现

PaddlePaddle / PaddleMIX

Paddle Multimodal Integration and eXploration, supporting mainstream multi-modal tasks, including end-to-end large-scale multi-modal pretrain models and diffusion model toolbox. Equipped with high performance and flexibility.

Apache License 2.0

359 stars 143 forks source link

GOT-OCR2.0 是由 StepFun 和中国科学院大学推出的专用于通用 OCR 任务的多模态大模型，参数量 0.6B，采用 vision encoder+input embedding layer+decoder 的 pipeline。我们需要跟进与丰富PaddleMIX中的跨模态文图模型，从模型、训练、推理等方面完善。

任务描述	详细内容	完成情况
GOT-OCR2.0基础模型复现,主要包含其依赖的相关基础组件	BlipImageEvalProcessor	done
	ImageEncoderViT	done
	GOTQwenModel	done
	GOTQwenForCausalLM	done
GOT-OCR2.0 推理 pipeline 构建	got_ocr2_0_infer	done
提供相关的 paddle 模型权重	model.safetensors	done
支持并对齐 GOT-OCR2.0 的 post-training 训练	待定	---

PaddlePaddle / PaddleMIX

【Hackathon 7th】Fundable project 5. 前沿文档多模态大模型飞桨复现 #833