deepseek-ai / DeepSeek-VL

DeepSeek-VL: Towards Real-World Vision-Language Understanding
https://huggingface.co/spaces/deepseek-ai/DeepSeek-VL-7B
MIT License
2.07k stars 195 forks source link

Fine-tuning Script #6

Open TechxGenus opened 8 months ago

TechxGenus commented 8 months ago

Congratulations to DeepSeek for the wonderful work. I wonder if there is a script for fine-tuning DeepSeek-VL? Thanks!

RERV commented 8 months ago

Hi, thank you for your interest. We are currently busy iterating DeepSeek-VL. The community has already started supporting DeepSeek-VL (#10 ). Have fun! https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/deepseek-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md

SinanAkkoyun commented 8 months ago

@RERV It seems as if swift does not support finetuning of the vision encoder (it seems that way from my quick glance over the source code, I hope I'm wrong)

Given that you are internally training deepseek VL somehow, could you provide training code snippets so that the community can work on an LLM and vision encoder finetuning script?

soloice commented 8 months ago

@RERV It seems as if swift does not support finetuning of the vision encoder (it seems that way from my quick glance over the source code, I hope I'm wrong)

Given that you are internally training deepseek VL somehow, could you provide training code snippets so that the community can work on an LLM and vision encoder finetuning script?

Internally we train DeepSeek-VL with hai-llm (as mentioned in the paper), which is a closed source training framework. We do hope to open source hai-llm someday, but that is a really big project, involving our training cluster configuration/management and other internal libraries. I'm afraid that we don't have any bandwidth working on cleaning up & open sourcing hai-llm core code right now.

SinanAkkoyun commented 8 months ago

@soloice Hi, I see, thanks. Would it be possible to just release the backprop code of the vision encoder, no framework around it, no clustering, just a starting point for the community to work upon?

soloice commented 8 months ago

@soloice Hi, I see, thanks. Would it be possible to just release the backprop code of the vision encoder, no framework around it, no clustering, just a starting point for the community to work upon?

Well, I can describe how to do this briefly. Basically you don't need to write backprop code, because torch will take care of everything. Just build the model, then set the requires_grad attribute in visual encoder will work:

for p in visual_encoder.parameters():
    p.requires_grad = True

What you really need to care about is distributed strategy. If you are using DDP or 3D parallel with TP=1, the above code is all you need; If you are using 3D parallel with TP>1, you will need to average the gradient of visual encoders on all tp ranks with an NCCL call looks like dist.all_reduce(p.grad, group=tensor_parallel_group) for parameters in visual encoder to make sure all tp ranks having the same gradient.

SinanAkkoyun commented 8 months ago

@soloice Thank you very much for the information!

Given the PyTorch grad, how would you go about training? In our use-case we need to add a bit of grounding by implementing a cursor as an output. How would that look like high- and low level? Would training the whole LLM + vision encoder work like that, just providing cursor tokens in the dataset and it would do the rest? Or does one have to train the vision encoder separately? Thank you for helping out

SinanAkkoyun commented 8 months ago

https://github.com/modelscope/swift/issues/543

Jintao-Huang implemented it!