OCR-D / ocrd-demo-mets-server

4 stars 1 forks source link

idea: add page filtering #2

Open bertsky opened 10 months ago

bertsky commented 10 months ago

In the parallel case, when computing the page range expression, we could add a filter to remove empty or cover pages from the processing pipeline (possibly also just creating an empty annotation for them via ocrd-dummy).

kba commented 10 months ago

Good idea. We could probably reuse the blacklist_logical_elements mechanism used in ODEM.

@m3ssman Besides front and back cover, what kind of pages do you exclude from processing with OCR-D?

bertsky commented 10 months ago

Good idea. We could probably reuse the blacklist_logical_elements mechanism used in ODEM.

@M3ssman Besides front and back cover, what kind of pages do you exclude from processing with OCR-D?

https://github.com/ulb-sachsen-anhalt/ocrd-odem/blob/302745e7045f1aab46ffcc8f3395b9e823808143/resources/odem.ini#L82-L84

For logical, I would also recommend cover,binding,bookplate,endsheet,privileges,note,spine,paste_down,colour_checker – cf. https://github.com/OCR-D/spec/issues/192

M3ssman commented 10 months ago

In Addition, one should consider the attribute values of ORDERLABEL or LABEL , if present.
Of course this information highly depends on the original digitization project's data domain and their workflows of old, even dealing with misspellings and alike.

AFAICS at ULB it's only used on page containers, but with respect to recent METS Specs they may appear too within logical structs. We should consider this, since non-DFG-METS doesn't really care for explicit physical structs.

bertsky commented 10 months ago

@M3ssman you mean something like contains("Auftragszettel,Colorchecker,Leerseite,Rückdeckel,Deckblatt,Vorderdeckel,Illustrat",@LABEL)?

(I have no experience and no data to grub.)

M3ssman commented 10 months ago

@bertsky Yes, like this. If you mind taking a closer look into ULB Drucke des 18. Jahrhunderts (VD18), there nearly any record (click on the the METS-OAI button down on the detail sites) contains something alike this in it's pages LABEL or ORDERLABEL.

bertsky commented 10 months ago

Got it, thanks! Most interesting. Like you said, it would depend on the particular rules of each digitisation process/institution, plus unintended deviations (typos, brackets).

So IMO your approach of making this configurable is the only adequate solution. The mechanism (config file, envvar or CLI param) should be discussed for OCR-D, though. IMO we need something to prevent unnecessary downloads and processing. But some dummy fallback output even for filtered pages is actually preferable. (So in #1, I would not filter out these pages in a separate pipeline step, but rather have the filter behave like a processing error.)

M3ssman commented 10 months ago

Anyway, I wonder how digital object are structured at your own houses? I can't image only in Halle they did this "Tiefenerschließung"-thing since 2009. I really advice you to get to know how this was handled in the past in your locations.

@bertsky According to SLUB OAI-API exist +44k records in Dresden

@kba According to SBB OAI-API reside +32k at Berlin

M3ssman commented 10 months ago

@bertsky Concerning unintended deviations: AFAICS, the annotation of content related information like [Colorchecker] , [Illustration] or [Leerseite] in the ORDERLABEL which can be found in Halle's digital objects seems to be the interpretation of semantics visual library server. According to recent METS- XSD, this information should have been also stored rather as TYPE, too. But this would somewhat increase confusion with our DFG-specific way of dividing/duplicating structure into logical and physical perspectives.

bertsky commented 10 months ago

@bertsky According to SLUB OAI-API exist +44k records in Dresden

Indeed, I can easily research this myself – thanks!

In fact, I did (for 15th-18th c. prints), using metha.

It looks like other than the obvious Strukturdatenset choices (@TYPE binding, spine, paste_down, cover, colophon, figure, endsheet, title_page etc), we only have unrestricted labels (e.g. Provenienzeintrag ..., Einband des ..., Wappenexlibris ... mostly of @TYPE=other) at the lower hierarchy levels (where usually @LABEL and @ORDERLABEL are identical). So in our case, it would not be feasible trying to catch this with some filter pattern.