Ucas-HaoranWei / GOT-OCR2.0

Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
5.7k stars 472 forks source link

Nougat weights are CC-BY-NC #158

Open staghado opened 1 day ago

staghado commented 1 day ago

I don't see how can the weights be apache-2.0 while some of the data used to train the model is CC-BY-NC(the Nougat subset for example). Thanks for your clarification.

Ucas-HaoranWei commented 1 day ago
  1. We do not use the Nougat weights and data.
  2. Are Nougat's data open-source? To my knowledge, they don't have open-source data, so it's impossible to use them.
  3. The data we use for training is completely different from Nougat, and we process the mathpix format.
  4. Nougat only inspired us to use Arxiv's LaTeX format to process data
staghado commented 23 hours ago

thanks for the clarification. I was confused and thought Nougat was used for the annotation. it’s clearer now. great work!

On Sun 27 Oct 2024 at 01:59, WeiHaoran @.***> wrote:

  1. We do not use the Nougat weights and data.
  2. Are the data of Nougat open-source? To my knowledge, they don't have open source the data, so it's impossible to use them.
  3. We drew inspiration from Nougat's approach to processing LaTeX data and created data for training GOT.
  4. The data we use for training is completely different from nougat, and the format we process is mathpix format.

— Reply to this email directly, view it on GitHub https://github.com/Ucas-HaoranWei/GOT-OCR2.0/issues/158#issuecomment-2439771827, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUBGX5HHVILIX6A3BVV5NSLZ5QUHZAVCNFSM6AAAAABQVFNVOSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZZG43TCOBSG4 . You are receiving this because you authored the thread.Message ID: @.***>