Ucas-HaoranWei / GOT-OCR2.0

Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
5.89k stars 500 forks source link

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Haoran Wei*, Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang

Release

Code License Data License

Community contributions

We encourage everyone to develop GOT applications based on this repo. Thanks for the following contributions :

vllm reference ~ contributor: @Jay

onnx and mnn supports ~ contributor: @BaofengZan

llama_cpp inference ~ contributor: @1694439208

Colab of GOT ~ contributor: @Zizhe Wang

CPU version of GOT ~ contributor: @ElvisClaros

Online demo ~ contributor: @Joseph Pollack

Dokcer & client demo ~ contributor: @QIN2DIM

GUI of GOT ~ contributor: @XJF2332

Contents


Towards OCR-2.0 via a Unified End-to-end Model


Install

  1. Our environment is cuda11.8+torch2.0.1

  2. Clone this repository and navigate to the GOT folder

    git clone https://github.com/Ucas-HaoranWei/GOT-OCR2.0.git
    cd 'the GOT folder'
  3. Install Package

    conda create -n got python=3.10 -y
    conda activate got
    pip install -e .
  4. Install Flash-Attention

    pip install ninja
    pip install flash-attn --no-build-isolation

    GOT Weights

Demo

  1. plain texts OCR:
    python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type ocr
  2. format texts OCR:
    python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format
  3. fine-grained OCR:
    python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format/ocr --box [x1,y1,x2,y2]
    python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format/ocr --color red/green/blue
  4. multi-crop OCR:
    python3 GOT/demo/run_ocr_2.0_crop.py  --model-name  /GOT_weights/ --image-file  /an/image/file.png 
  5. multi-page OCR (the image path contains multiple .png files):
    python3 GOT/demo/run_ocr_2.0_crop.py  --model-name  /GOT_weights/ --image-file  /images/path/  --multi-page
  6. render the formatted OCR results:
    python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format --render

    Note: The rendering results can be found in /results/demo.html. Please open the demo.html to see the results.

Train

  1. Train sample can be found here. Note that the '\' in the 'conversations'-'human'-'value' is necessary!
  2. This codebase only supports post-training (stage-2/stage-3) upon our GOT weights.
  3. If you want to train from stage-1 described in our paper, you need this repo.
deepspeed   /GOT-OCR-2.0-master/GOT/train/train_GOT.py \
 --deepspeed /GOT-OCR-2.0-master/zero_config/zero2.json    --model_name_or_path /GOT_weights/ \
 --use_im_start_end True   \
 --bf16 True   \
 --gradient_accumulation_steps 2    \
 --evaluation_strategy "no"   \
 --save_strategy "steps"  \
 --save_steps 200   \
 --save_total_limit 1   \
 --weight_decay 0.    \
 --warmup_ratio 0.001     \
 --lr_scheduler_type "cosine"    \
 --logging_steps 1    \
 --tf32 True     \
 --model_max_length 8192    \
 --gradient_checkpointing True   \
 --dataloader_num_workers 8    \
 --report_to none  \
 --per_device_train_batch_size 2    \
 --num_train_epochs 1  \
 --learning_rate 2e-5   \
 --datasets pdf-ocr+scence \
 --output_dir /your/output/path

Note:

  1. Change the corresponding data information in constant.py.
  2. Change line 37 in conversation_dataset_qwen.py to your data_name.

Fine-tune

Quick Fine-tune with ms-swift:

git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .[llm]
# default:sft LLM & projector, freeze vision encoder
CUDA_VISIBLE_DEVICES=0 swift sft\
--model_type got-ocr2 \
--model_id_or_path stepfun-ai/GOT-OCR2_0 \
--sft_type lora \
--dataset latex-ocr-print#5000

# Deepspeed ZeRO2
NPROC_PER_NODE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
--model_type got-ocr2 \
--model_id_or_path stepfun-ai/GOT-OCR2_0 \
--sft_type lora \
--dataset latex-ocr-print#5000 \
--deepspeed default-zero2

With your data:

--dataset train.jsonl
--val_dataset val.jsonl (optional)

Data format:

{"query": "<image>55555", "response": "66666", "images": ["image_path"]}
{"query": "<image><image>eeeee", "response": "fffff", "history": [], "images": ["image_path1", "image_path2"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]]}

More details can be seen in ms-swift.

Eval

  1. We use the Fox and OneChart benchmarks, and other benchmarks can be found in the weights download link.
  2. The eval codes can be found in GOT/eval.
  3. You can use the evaluate_GOT.py to run the eval. If you have 8 GPUs, the --num-chunks can be set to 8.
    python3 GOT/eval/evaluate_GOT.py --model-name /GOT_weights/ --gtfile_path xxxx.json --image_path  /image/path/ --out_path /data/eval_results/GOT_mathpix_test/ --num-chunks 8 --datatype OCR

Contact

If you are interested in this work or have questions about the code or the paper, please join our communication Wechat group.

Note: All five wechat groups are full, please join group 6.

Don't hesitate to contact me by email, weihaoran18@mails.ucas.ac.cn, if you have any questions.

Acknowledgement

Citation


@article{wei2024general,
  title={General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model},
  author={Wei, Haoran and Liu, Chenglong and Chen, Jinyue and Wang, Jia and Kong, Lingyu and Xu, Yanming and Ge, Zheng and Zhao, Liang and Sun, Jianjian and Peng, Yuang and others},
  journal={arXiv preprint arXiv:2409.01704},
  year={2024}
}
@article{wei2023vary,
  title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2312.06109},
  year={2023}
}