Open TechxGenus opened 8 months ago
Hi, thank you for your interest. We are currently busy iterating DeepSeek-VL. The community has already started supporting DeepSeek-VL (#10 ). Have fun! https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/deepseek-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md
@RERV It seems as if swift does not support finetuning of the vision encoder (it seems that way from my quick glance over the source code, I hope I'm wrong)
Given that you are internally training deepseek VL somehow, could you provide training code snippets so that the community can work on an LLM and vision encoder finetuning script?
@RERV It seems as if swift does not support finetuning of the vision encoder (it seems that way from my quick glance over the source code, I hope I'm wrong)
Given that you are internally training deepseek VL somehow, could you provide training code snippets so that the community can work on an LLM and vision encoder finetuning script?
Internally we train DeepSeek-VL with hai-llm (as mentioned in the paper), which is a closed source training framework. We do hope to open source hai-llm someday, but that is a really big project, involving our training cluster configuration/management and other internal libraries. I'm afraid that we don't have any bandwidth working on cleaning up & open sourcing hai-llm core code right now.
@soloice Hi, I see, thanks. Would it be possible to just release the backprop code of the vision encoder, no framework around it, no clustering, just a starting point for the community to work upon?
@soloice Hi, I see, thanks. Would it be possible to just release the backprop code of the vision encoder, no framework around it, no clustering, just a starting point for the community to work upon?
Well, I can describe how to do this briefly. Basically you don't need to write backprop code, because torch will take care of everything. Just build the model, then set the requires_grad attribute in visual encoder will work:
for p in visual_encoder.parameters():
p.requires_grad = True
What you really need to care about is distributed strategy. If you are using DDP or 3D parallel with TP=1, the above code is all you need; If you are using 3D parallel with TP>1, you will need to average the gradient of visual encoders on all tp ranks with an NCCL call looks like dist.all_reduce(p.grad, group=tensor_parallel_group)
for parameters in visual encoder to make sure all tp ranks having the same gradient.
@soloice Thank you very much for the information!
Given the PyTorch grad, how would you go about training? In our use-case we need to add a bit of grounding by implementing a cursor as an output. How would that look like high- and low level? Would training the whole LLM + vision encoder work like that, just providing cursor tokens in the dataset and it would do the rest? Or does one have to train the vision encoder separately? Thank you for helping out
https://github.com/modelscope/swift/issues/543
Jintao-Huang implemented it!
Congratulations to DeepSeek for the wonderful work. I wonder if there is a script for fine-tuning DeepSeek-VL? Thanks!