关于GeoLayoutLM中文支持模型

AlibabaResearch / AdvancedLiterateMachinery

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

Apache License 2.0

1.37k stars 165 forks source link

关于GeoLayoutLM中文支持模型 #25

Closed 24-solar-terms closed 1 year ago

24-solar-terms commented 1 year ago

非常感谢达摩院读光组的工作，GeoLayoutLM是个很棒的模型，但是这个模型使用的bert base的tokenizer，请问有预训练对中文支持的模型吗，或者未来会发布吗？ @wdp-007 @alibaba-oss @congyao 非常感谢！

luochuwei commented 1 year ago

@24-solar-terms 在modelscope上有支持中英文的模型 https://modelscope.cn/models/damo/multi-modal_convnext-roberta-base_vldoc-embedding/summary

24-solar-terms commented 1 year ago

好的非常感谢！

malichen-cv commented 1 year ago

@24-solar-terms 在modelscope上有支持中英文的模型 https://modelscope.cn/models/damo/multi-modal_convnext-roberta-base_vldoc-embedding/summary

感谢作者回复，funsd数据集开源的模型用的是bert base的tokenizer，想问下modelscope上中英文模型使用的什么tokenizer?

luochuwei commented 1 year ago

@malichen-cv modelscope上有提供tokenizer，用的是XLMRoberta的tokenizer

NextGuido commented 1 year ago

@malichen-cv hello，有一个新问题。我有一批需要训练的中文数据，按照上面的说法，我现在只需要更改tokenizer就可以了是吗？还是说我的模型backbone和model_ckpt也同样需要去修改？