Can you address the work you have referenced?

lucasjinreal commented 5 months ago

For instance:

SAM pretrained model didn't listed or quated in readme;
the whole architecture looks very same to Vary, please add quate to it and possible made modifications and compare to original work?

thanks.

RERV commented 5 months ago

Hi, thanks for your interest.

Our DeepSeek-VL use pure SAM pretrained by Meta AI, as we mentioned and cited in our paper.
We have cited Vary, and also closely discussed with Haoran Wei, the first Author of Vary.

lucasjinreal commented 5 months ago

It would be better write these citations in README.

The modifications insights would be benificial to community as I noticed there are some differences between Vary and deepseek-VL.

Since the whole code base not really opensourced, and DeepSeek acutally always being the frontier in Chinese opensource, you can give more dicussions publicly for others to learn. These information might be curcial for users who use deepseek, not just open a model.

LingyvKong commented 5 months ago

Agree with @lucasjinreal , looking forward to seeing the comparison results with Vary.

RERV commented 5 months ago

Thanks for your interest. It is important to note that our choice of DeepSeek-VL for the vision encoder was not based on Vary, and the motivations behind Vary and DeepSeek-VL on the vision encoder are quite different.

Vary aims to equip the model with OCR capabilities by specifically training on documents using SAM, then merging with the original CLIP to enhance the model's OCR abilities.
On the other hand, DeepSeek-VL aims to enable the model to process high-resolution images of 1024 * 1024 dimensions as input, thereby avoiding the loss of low-level vision information in general real-world tasks. During training, we treat them as a unified Hybrid Encoder and conduct collaborative training on all our data. We have also tried other vision encoders, discussion regarding the model's encoder is also addressed in the main text.

In summary, the motivations and specific initialization parameters differ between Vary and DeepSeek-VL on the vision encoder, as do their training strategies. They share similarities in utilizing a Hybrid Encoder and SAM as the second encoder, as we cited in our report. However, the Hybrid Encoder concept itself is common (not Vary's core contribution), and the SAM structure is not unique to Vary's work. Therefore, we did not extensively elaborate on the differences between the two in the paper and the GitHub repository.

lucasjinreal commented 5 months ago

From the code I can notice that the SAM used way are actually a little bit different then Vary, from the aspect of Hybrid it could treat as same.

But here is the question: I tested Deepseek VL, the OCR ability are weak, why not choose using Vary's training strategies instead using combined directly? Is it better? (from result clearly it's not)

Also, there is a high level feature jump to output in your version of SAM, is it better than Vary? (I didn't saw any ablation study in paper in this part)

deepseek-ai / DeepSeek-VL

Can you address the work you have referenced? #19