clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
https://arxiv.org/abs/2111.15664
MIT License
5.53k stars 444 forks source link

More languages support? #217

Open zhangluustb opened 1 year ago

zhangluustb commented 1 year ago

First, thank you for open-sourcing this repo.

For more language support, Could you please provide a modifiied script of the mbart50 decoder without pruning?

I had some difficulty when modifying it myself...

All the best wishes to you!

zhangluustb commented 1 year ago

I just use this code but the bart model size is huge:

# self.tokenizer = XLMRobertaTokenizer.from_pretrained(
#     "hyunwoongko/asian-bart-ecjk" if not name_or_path else name_or_path
# )
self.tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

self.model = MBartForCausalLM.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

# MBartForCausalLM(
#     config=MBartConfig(
#         is_decoder=True,
#         is_encoder_decoder=False,
#         add_cross_attention=True,
#         decoder_layers=self.decoder_layer,
#         max_position_embeddings=self.max_position_embeddings,
#         vocab_size=len(self.tokenizer),
#         scale_embedding=True,
#         add_final_layer_norm=True,
#     )
# )