Training and benchmark results on V100

yuyq96 commented 7 months ago

Thank you for open sourcing the data and code for UReader. I used scripts/train_it_v100.sh to train UReader. However, I was unable to reproduce the benchmark results.

Pretrained checkpoint: MAGAer13/mplug-owl-llama-7b

Training loss curve: train_loss

Benchmark results:		DocVQA	InfoVQA	DeepForm	KLC	WTQ	TabFact	ChartQA
Official	65.4	42.2	49.5	32.8	29.4	67.6	59.3
Replication on V100	50.6	32.1	21.8	28.1	19.5	64.2	46.0

I notice that the micro batch size settings are different on A100 and V100, and it leads to different reduced losses and might affect the training. Other differences between the script and paper include:

Learning rate scheduler, which is linear instead of cosine.
Maximum anchor size, which is 11 instead of 9.
Averaged IoU, which equals 100*shape_iou+iou instead of shape_iou+iou.

@LukeForeverYoung Have you tried completing the training on V100? Could you please verify the loss curve and these results? Thanks!

bellos1203 commented 2 months ago

Hey @yuyq96, I got similar results. Have you resolved the issue?

yuyq96 commented 2 months ago

Hey @yuyq96, I got similar results. Have you resolved the issue?

Unfortunately, I wasn't able to replicate the results on V100 using the official settings, and I don't have access to an A100 either. We've been experimenting with quite different training settings in our model.

LukeForeverYoung / UReader

Training and benchmark results on V100 #10