Hi,
I read your report, and I think the pipeline is very similar to Vary. I have a question:
Does the sam-branch use the Vary initialization for dense OCR? Based on my experiments, the vision latent output by the original Sam is noisy for text-latent-based LLM.
Hi, I read your report, and I think the pipeline is very similar to Vary. I have a question: Does the sam-branch use the Vary initialization for dense OCR? Based on my experiments, the vision latent output by the original Sam is noisy for text-latent-based LLM.