Open jbarth-ubhd opened 9 months ago
You mean as in
<mets:agent TYPE="OTHER" OTHERTYPE="SOFTWARE" ROLE="OTHER" OTHERROLE="layout/segmentation/region">
<mets:name>ocrd-tesserocr-recognize v0.17.0 (tesseract 5.3.1-25-gcf23)</mets:name>
<mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="input-file-grp">OCR-D-BIN</mets:note>
<mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="output-file-grp">OCR-D-BIN-OCR-TESS-frak2021</mets:note>
<mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="parameter">{"model": "frak2021", "dpi": 0, "padding": 0, "segmentat
ion_level": "word", "textequiv_level": "word", "overwrite_segments": false, "overwrite_text": true, "shrink_polygons": false, "
block_polygons": false, "find_tables": true, "find_staves": false, "sparse_text": false, "raw_lines": false, "char_whitelist":
"", "char_blacklist": "", "char_unblacklist": "", "tesseract_parameters": {}, "xpath_parameters": {}, "xpath_model": {}, "auto_
model": false, "oem": "DEFAULT"}</mets:note>
<mets:note xmlns:ocrd="https://ocr-d.de" ocrd:cksum="1509050540 3421140"/>
<mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="page-id"/>
</mets:agent>
@jbarth-ubhd, or did you mean the PAGE XML?
In METS, we could also use some information on processing dates, e.g. mets:agent/mets:note/@ocrd:date
(of xsd:dateTime
). What do you think @kba?
That's a great idea!
Incidentally, we're in the process of dealing with the reality of mass OCR, i.e. what to throw away to keep the amount of data manageable while still retaining as much reproducibility information as possible. This would help.
The tricky part is how and what to hash.
A simple solution would be to assume the checksum is related to the raw data that ocrd resmgr
retrieves, i.e. the models or zipped models as they are downloaded via HTTP. We could add the checksum to the resources
section of the ocrd-tool.json
schema (and therefore the ocrd resmgr
schema).
A helpful side effect would be that we notice when models are updated at the same URL (e.g. the messy situation with eynollah currently).
In METS, we could also use some information on processing dates, e.g.
mets:agent/mets:note/@ocrd:date
(ofxsd:dateTime
).
Indeed, I will need to improve the page-to-alto conversion soon-ish and find better solution for dates (I haven't forgotten about the feedback on https://github.com/kba/page-to-alto/pull/37 btw) and other metadata. If we had more granular and easier to interpret date info that would help a lot.
@bertsky: I don't have cksum in mets.xml (and not in OCR-D-OCR_00001.xml) (installed ocrd/all docker a few weeks ago):
<mets:agent TYPE="OTHER" OTHERTYPE="SOFTWARE" ROLE="OTHER" OTHERROLE=
"layout/segmentation/region">
<mets:name>ocrd-tesserocr-recognize v0.17.0 (tesseract 5.3.3)</mets:name>
<mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option=
"input-file-grp">OCR-D-005</mets:note>
<mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option=
"output-file-grp">OCR-D-OCR</mets:note>
<mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="parameter">{"textequiv_level":
"word", "segmentation_level": "region", "overwrite_segments": true, "model": "frak2021",
"dpi": 0, "padding": 0, "overwrite_text": true, "shrink_polygons": false,
"block_polygons": false, "find_tables": true, "find_staves": false, "sparse_text": false,
"raw_lines": false, "char_whitelist": "", "char_blacklist": "", "char_unblacklist": "",
"tesseract_parameters": {}, "xpath_parameters": {}, "xpath_model": {}, "auto_model":
false, "oem": "DEFAULT"}</mets:note>
<mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="page-id"/>
</mets:agent>
A simple solution would be to assume the checksum is related to the raw data that
ocrd resmgr
retrieves, i.e. the models or zipped models as they are downloaded via HTTP. We could add the checksum to theresources
section of theocrd-tool.json
schema (and therefore theocrd resmgr
schema).
I don't understand – wouldn't that be the repository side (ocrd-tool.json), rather than the user side (resources.yml)?
We could certainly have resmgr store that information, but what about manual (cp
) or existing installations?
I would rather like the processor to look at the file exactly when it is used, i.e. during resolve_resource
. Since we have that as a method of the Processor class, how about a little side effect: determining the checksum of the retrieved file and storing it in a hidden attribute of the processor instance, like say self._resources
? Then our run_processor
could automatically add the checksum info during its workspace.mets.add_agent
call – no further code changes required!
In METS, we could also use some information on processing dates, e.g.
mets:agent/mets:note/@ocrd:date
(ofxsd:dateTime
).Indeed, I will need to improve the page-to-alto conversion soon-ish and find better solution for dates (I haven't forgotten about the feedback on kba/page-to-alto#37 btw) and other metadata. If we had more granular and easier to interpret date info that would help a lot.
Isn't that a separate issue though? In ocrd_modelfactory.page_from_image
, we do set PAGE's Created
and LastChange
– but we do not set the latter whenever we add annotation via a processor's save_xml
.
The METS side is independent, though.
I would rather like the processor to look at the file exactly when it is used, i.e. during
resolve_resource
. Since we have that as a method of the Processor class, how about a little side effect: determining the checksum of the retrieved file and storing it in a hidden attribute of the processor instance, like sayself._resources
? Then ourrun_processor
could automatically add the checksum info during itsworkspace.mets.add_agent
call – no further code changes required!
Yeah, that's the more robust and elegant solution :+1:
In METS, we could also use some information on processing dates, e.g. mets:agent/mets:note/@ocrd:date (of xsd:dateTime).
Indeed, I will need to improve the page-to-alto conversion soon-ish and find better solution for dates (I haven't forgotten about the feedback on kba/page-to-alto#37 btw) and other metadata. If we had more granular and easier to interpret date info that would help a lot.
Isn't that a separate issue though? In ocrd_modelfactory.page_from_image, we do set PAGE's Created and LastChange – but we do not set the latter whenever we add annotation via a processor's save_xml.
The METS side is independent, though.
Yeah, sry, it's late. We had a call on that subject (getting OCR and metadata into digital library) today, so it came to mind.
@bertsky: I don't have cksum in mets.xml (and not in OCR-D-OCR_00001.xml) (installed ocrd/all docker a few weeks ago):
@jbarth-ubhd This was just a proposal by @bertsky how it could finally look, not the current situation. We'll still need to implement chksum
of course.
for reproducibility, it would be nice to have a checksum and/or file size of models used in XML.