OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
120 stars 32 forks source link

checksum and/or file size of models in .PAGE.xml #1183

Open jbarth-ubhd opened 9 months ago

jbarth-ubhd commented 9 months ago

for reproducibility, it would be nice to have a checksum and/or file size of models used in XML.

bertsky commented 9 months ago

You mean as in

    <mets:agent TYPE="OTHER" OTHERTYPE="SOFTWARE" ROLE="OTHER" OTHERROLE="layout/segmentation/region">
      <mets:name>ocrd-tesserocr-recognize v0.17.0 (tesseract 5.3.1-25-gcf23)</mets:name>
      <mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="input-file-grp">OCR-D-BIN</mets:note>
      <mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="output-file-grp">OCR-D-BIN-OCR-TESS-frak2021</mets:note>
      <mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="parameter">{"model": "frak2021", "dpi": 0, "padding": 0, "segmentat
ion_level": "word", "textequiv_level": "word", "overwrite_segments": false, "overwrite_text": true, "shrink_polygons": false, "
block_polygons": false, "find_tables": true, "find_staves": false, "sparse_text": false, "raw_lines": false, "char_whitelist": 
"", "char_blacklist": "", "char_unblacklist": "", "tesseract_parameters": {}, "xpath_parameters": {}, "xpath_model": {}, "auto_
model": false, "oem": "DEFAULT"}</mets:note>
      <mets:note xmlns:ocrd="https://ocr-d.de" ocrd:cksum="1509050540 3421140"/>
      <mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="page-id"/>
    </mets:agent>

@jbarth-ubhd, or did you mean the PAGE XML?

In METS, we could also use some information on processing dates, e.g. mets:agent/mets:note/@ocrd:date (of xsd:dateTime). What do you think @kba?

kba commented 9 months ago

That's a great idea!

Incidentally, we're in the process of dealing with the reality of mass OCR, i.e. what to throw away to keep the amount of data manageable while still retaining as much reproducibility information as possible. This would help.

The tricky part is how and what to hash.

A simple solution would be to assume the checksum is related to the raw data that ocrd resmgr retrieves, i.e. the models or zipped models as they are downloaded via HTTP. We could add the checksum to the resources section of the ocrd-tool.json schema (and therefore the ocrd resmgr schema).

A helpful side effect would be that we notice when models are updated at the same URL (e.g. the messy situation with eynollah currently).

kba commented 9 months ago

In METS, we could also use some information on processing dates, e.g. mets:agent/mets:note/@ocrd:date (of xsd:dateTime).

Indeed, I will need to improve the page-to-alto conversion soon-ish and find better solution for dates (I haven't forgotten about the feedback on https://github.com/kba/page-to-alto/pull/37 btw) and other metadata. If we had more granular and easier to interpret date info that would help a lot.

jbarth-ubhd commented 9 months ago

@bertsky: I don't have cksum in mets.xml (and not in OCR-D-OCR_00001.xml) (installed ocrd/all docker a few weeks ago):

<mets:agent TYPE="OTHER" OTHERTYPE="SOFTWARE" ROLE="OTHER" OTHERROLE=
"layout/segmentation/region">
  <mets:name>ocrd-tesserocr-recognize v0.17.0 (tesseract 5.3.3)</mets:name>
  <mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option=
"input-file-grp">OCR-D-005</mets:note>
  <mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option=
"output-file-grp">OCR-D-OCR</mets:note>
  <mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="parameter">{"textequiv_level":
"word", "segmentation_level": "region", "overwrite_segments": true, "model": "frak2021",
"dpi": 0, "padding": 0, "overwrite_text": true, "shrink_polygons": false,
"block_polygons": false, "find_tables": true, "find_staves": false, "sparse_text": false,
"raw_lines": false, "char_whitelist": "", "char_blacklist": "", "char_unblacklist": "",
"tesseract_parameters": {}, "xpath_parameters": {}, "xpath_model": {}, "auto_model":
false, "oem": "DEFAULT"}</mets:note>
  <mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="page-id"/>
</mets:agent>
bertsky commented 9 months ago

A simple solution would be to assume the checksum is related to the raw data that ocrd resmgr retrieves, i.e. the models or zipped models as they are downloaded via HTTP. We could add the checksum to the resources section of the ocrd-tool.json schema (and therefore the ocrd resmgr schema).

I don't understand – wouldn't that be the repository side (ocrd-tool.json), rather than the user side (resources.yml)?

We could certainly have resmgr store that information, but what about manual (cp) or existing installations?

I would rather like the processor to look at the file exactly when it is used, i.e. during resolve_resource. Since we have that as a method of the Processor class, how about a little side effect: determining the checksum of the retrieved file and storing it in a hidden attribute of the processor instance, like say self._resources? Then our run_processor could automatically add the checksum info during its workspace.mets.add_agent call – no further code changes required!

bertsky commented 9 months ago

In METS, we could also use some information on processing dates, e.g. mets:agent/mets:note/@ocrd:date (of xsd:dateTime).

Indeed, I will need to improve the page-to-alto conversion soon-ish and find better solution for dates (I haven't forgotten about the feedback on kba/page-to-alto#37 btw) and other metadata. If we had more granular and easier to interpret date info that would help a lot.

Isn't that a separate issue though? In ocrd_modelfactory.page_from_image, we do set PAGE's Created and LastChange – but we do not set the latter whenever we add annotation via a processor's save_xml.

The METS side is independent, though.

kba commented 9 months ago

I would rather like the processor to look at the file exactly when it is used, i.e. during resolve_resource. Since we have that as a method of the Processor class, how about a little side effect: determining the checksum of the retrieved file and storing it in a hidden attribute of the processor instance, like say self._resources? Then our run_processor could automatically add the checksum info during its workspace.mets.add_agent call – no further code changes required!

Yeah, that's the more robust and elegant solution :+1:

In METS, we could also use some information on processing dates, e.g. mets:agent/mets:note/@ocrd:date (of xsd:dateTime).

Indeed, I will need to improve the page-to-alto conversion soon-ish and find better solution for dates (I haven't forgotten about the feedback on kba/page-to-alto#37 btw) and other metadata. If we had more granular and easier to interpret date info that would help a lot.

Isn't that a separate issue though? In ocrd_modelfactory.page_from_image, we do set PAGE's Created and LastChange – but we do not set the latter whenever we add annotation via a processor's save_xml.

The METS side is independent, though.

Yeah, sry, it's late. We had a call on that subject (getting OCR and metadata into digital library) today, so it came to mind.

@bertsky: I don't have cksum in mets.xml (and not in OCR-D-OCR_00001.xml) (installed ocrd/all docker a few weeks ago):

@jbarth-ubhd This was just a proposal by @bertsky how it could finally look, not the current situation. We'll still need to implement chksum of course.