Open bruceisme opened 2 months ago
Hi, the word in Appendix A means that we do not perform an extra PaddleOCR detector for evaluation. For the TextVQA, we keep the OCR Token with that in LLaVA. It should have a worse result without the original OCR tokens.
In Appendix A's Image-text Data Collection, mention "It is important to note that the OCR detector is utilized solely for generating enriched data and is not employed during testing ". But the textvqa scripts is using![5a6ce66bec9d6006880fe0724c32204](https://github.com/dvlab-research/MGM/assets/49301955/2d38ee7f-c4c2-46e2-a285-69c85a3610f6)
llava_textvqa_val_v051_ocr.jsonl
which has ocr. So have you ever test a version without ocr in textvqa, was it worse thanllava_textvqa_val_v051_ocr.jsonl
? can we understand that model could get better result with ocr input?