OpenPecha / Toolkit

🛠 Tools to create, edit and export texts and annotations
https://toolkit.openpecha.org
Apache License 2.0
7 stars 4 forks source link

add OCR information to opf #178

Closed eroux closed 2 years ago

eroux commented 2 years ago

In addition to #162 and #164, there are two points in https://github.com/OpenPecha/OCR-Helper-Scripts/issues/6 that are probably best tracked in their own issue. It consists in adding some information about the OCR in the opf, so that we know what version of the ocr of a set of scans has been used to create an opf (I don't think we can now).

When the OCR is imported from, say s3://ocr.bdrc.io/Works/83/W2PD17457/google_books/batch_2022/, there's an info.json that contains at least

{
  "timestamp": "1977-04-22T06:00:00Z"
}

but possibly other things, so let's say it has other properties like:


{
  "timestamp": "1977-04-22T06:00:00Z",
  "prop1": "value 1",
  "prop2": "value 2"
}

Then we should add the following to the meta.yml

ocr_import_info:
   source: bdrc
   software: google_books
   batch: batch_2022
   ocr_info:
      timestamp: 1977-04-22T06:00:00Z
      prop1: value 1
      prop2: value 2
   parser: https://github.com/OpenPecha-dev/openpecha-toolkit/blob/231bba39dd1ba393320de82d4d08a604aabe80fc/openpecha/formatters/google_orc.py

(I've put the parser there, I think it fits well)