OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
119 stars 31 forks source link

list_all_resources does not handle subdirectories well #750

Closed bertsky closed 2 years ago

bertsky commented 2 years ago

In ocrd_utils.list_all_resources (used by a processor's --list-resources and resmgr's --list-installed), entries that are themselves directories do not get traversed recursively.

That's okay if the processor takes directory resources (e.g. ocrd-calamari-recognize's h5/json checkpoint dirs), but not if its resources are files in subdirectories (e.g. ocrd-tesserocr-recognize's script/*.traineddata or configs/*).

So IMO a better behaviour would be to list all such directories recursively.

bertsky commented 2 years ago

On the other hand, it will be very difficult to discern the two cases, especially during ResourceManager.add_to_user_database (i.e. unknown resources).

Perhaps we should entirely rule out Tesseract-style subdirectory resources by definition (and prevent such installations during make install-models)?

bertsky commented 2 years ago

But even then, the current implementation of --show-resource does not handle the directory case at all. It insists on is_file=True, and cannot find any. (Files are not handled correctly, too – but that's another story, see #752.)

I think we should not dictate the file status a priori, only its existence. Then, for directories, one could list their contents.

krvoigt commented 2 years ago

Hey team! Please add your planning poker estimate with ZenHub @kba @lena-hinrichsen

kba commented 2 years ago

So IMO a better behaviour would be to list all such directories recursively.

Would we really need full recursion to support tesseract-like subdirectories? From what I have seen, there isn't any deeper nesting than 1, e.g. script/Fraktur or configs/alto?

kba commented 2 years ago

One way to reliably solve this and not output folders for processors that only accept files and vice versa would be to add a resources_are_files boolean to the ocrd-tool.json (which would be false e.g. for calamari or eynollah) and prune the candidates based on that. Or are there any processors (or possible fututure processors) that require both? It won't help to add this info to the resource entries in the resource_list.yml, since we do not know about the resources before we're searching for them (or I might overlook a solution here).

kba commented 2 years ago

But even then, the current implementation of --show-resource does not handle the directory case at all. It insists on is_file=True, and cannot find any

What should we show when calling --show-resource on a directory? Tar-gzipped folder contents?

bertsky commented 2 years ago

One way to reliably solve this and not output folders for processors that only accept files and vice versa would be to add a resources_are_files boolean to the ocrd-tool.json (which would be false e.g. for calamari or eynollah) and prune the candidates based on that. Or are there any processors (or possible fututure processors) that require both? It won't help to add this info to the resource entries in the resource_list.yml, since we do not know about the resources before we're searching for them (or I might overlook a solution here).

Yes, indeed, good idea. But then, why not just use content-type: text/directory or format: directory?

What should we show when calling --show-resource on a directory? Tar-gzipped folder contents?

Well, I don't know what exactly the use-case is for --show-resource of files (presumably redirection to files, esp. if they are binary). But for directories it should at least list the contents (and yes, generating a tar stream would even make redirection possible).

krvoigt commented 2 years ago
kba commented 2 years ago

Yes, indeed, good idea. But then, why not just use content-type: text/directory or format: directory?

format is a property of string properties so it cannot be used on the level of a tool in ocrd-tool.json. content-type is fairly broad, it might be misunderstood as the format of input or output data, rather than resources. How about resource-content-type?

What should we show when calling --show-resource on a directory? Tar-gzipped folder contents?

Well, I don't know what exactly the use-case is for --show-resource of files (presumably redirection to files, esp. if they are binary). But for directories it should at least list the contents (and yes, generating a tar stream would even make redirection possible).

I am not so sure myself anymore. I was vaguely thinking about verifying that a certain resource/model a processor uses is indeed at the exact same version/checksum one expects. Or to share binary data of a resource withouot using the file system. But these are at best potential use cases. Should we drop the feature? Or replace it with an arguably more sensible --show-resource-location which outputs the full path to the resource, like --list-resources does?

bertsky commented 2 years ago

Yes, indeed, good idea. But then, why not just use content-type: text/directory or format: directory?

format is a property of string properties so it cannot be used on the level of a tool in ocrd-tool.json. content-type is fairly broad, it might be misunderstood as the format of input or output data, rather than resources. How about resource-content-type?

Sorry, I should have been more specific: I do mean string type parameters. We currently use them with format=uri and some content-type for model files. But we could also require that model directories use format=uri with content-type=text/directory to signify that fact.

bertsky commented 2 years ago

I am not so sure myself anymore. I was vaguely thinking about verifying that a certain resource/model a processor uses is indeed at the exact same version/checksum one expects. Or to share binary data of a resource withouot using the file system. But these are at best potential use cases. Should we drop the feature?

I do think these use cases are valid (if mere potential at this point). But mind that due to #752 they don't work so far.

Or replace it with an arguably more sensible --show-resource-location which outputs the full path to the resource, like --list-resources does?

No, why? The latter already outputs the full path of the resource.

kba commented 2 years ago

Sorry, I should have been more specific: I do mean string type parameters. We currently use them with format=uri and some content-type for model files. But we could also require that model directories use format=uri with content-type=text/directory to signify that fact.

Oh, right, yes, that is possible and makes sense. I'll adapt https://github.com/OCR-D/spec/pull/189.

This does make it slightly more difficult to implement and could possibly introduce ambiguity (a processor might define one file parameter as text/directory and another as application/json) though. Maybe adding a property for_parameter to the ocrd_resources.yml schema could help here.

But since gitter is down and the resource-content-type changes are obsolete, I'll think about it some more ;-)

Or replace it with an arguably more sensible --show-resource-location which outputs the full path to the resource, like --list-resources does?

No, why? The latter already outputs the full path of the resource.

Mostly because it at least would fullfil its stated purpose, giving you the full path to a resource by name and would not require different behaviors for files vs. directories.

bertsky commented 2 years ago

This does make it slightly more difficult to implement and could possibly introduce ambiguity (a processor might define one file parameter as text/directory and another as application/json) though. Maybe adding a property for_parameter to the ocrd_resources.yml schema could help here.

Oh, I haven't thought about that yet. for_parameter sounds clean but would make managing the database (think renaming paramers) and usage of resmgr a lot more complicated.

Perhaps resource manager could just guess the type from the path name (perhaps restricted by the ocrd-tool parameter mimetypes) or by peeking into the fd?

bertsky commented 2 years ago

Or replace it with an arguably more sensible --show-resource-location which outputs the full path to the resource, like --list-resources does?

No, why? The latter already outputs the full path of the resource.

Mostly because it at least would fullfil its stated purpose, giving you the full path to a resource by name and would not require different behaviors for files vs. directories.

But --list-resources does not need to behave differently, does it?

kba commented 2 years ago

Or replace it with an arguably more sensible --show-resource-location which outputs the full path to the resource, like --list-resources does?

No, why? The latter already outputs the full path of the resource.

Mostly because it at least would fullfil its stated purpose, giving you the full path to a resource by name and would not require different behaviors for files vs. directories.

But --list-resources does not need to behave differently, does it?

No, that remains useful as-is and needs no change.

bertsky commented 2 years ago

No, that remains useful as-is and needs no change.

But then I could just match the suffixes of list-resources with the requested resource name – still no need for an extra --show-resource-location.

kba commented 2 years ago

@bertsky can be closed?

bertsky commented 2 years ago

Currently for ocrd-tesserocr-recognize the subdirectories (configs, fast, script) are listed both recursively and as files. But IIRC this will go away as soon as we overhaul its ocrd-tool.json parameter specs. Everything else seems to work fine now – thanks!

kba commented 2 years ago

Currently for ocrd-tesserocr-recognize the subdirectories (configs, fast, script) are listed both recursively and as files. But IIRC this will go away as soon as we overhaul its ocrd-tool.json parameter specs. Everything else seems to work fine now – thanks!

With content-type set to something that isn't text/directory, the directories won't show. We still show the files in configs though, not just the models. But that's something we'll have to live with I think.

bertsky commented 2 years ago

With content-type set to something that isn't text/directory, the directories won't show.

I do see them though:

ocrd-tesserocr-recognize/GT4HistOCR_100000.traineddata
ocrd-tesserocr-recognize/Fraktur_GT4HistOCR.traineddata
ocrd-tesserocr-recognize/Fraktur.traineddata
ocrd-tesserocr-recognize/chi_sim.traineddata
ocrd-tesserocr-recognize/frk.traineddata
ocrd-tesserocr-recognize/configs
ocrd-tesserocr-recognize/ONB.traineddata
ocrd-tesserocr-recognize/deu-frak.traineddata
ocrd-tesserocr-recognize/GT4HistOCR_2000000.traineddata
ocrd-tesserocr-recognize/frak2021.traineddata
ocrd-tesserocr-recognize/fast
ocrd-tesserocr-recognize/chi_tra_vert.traineddata
ocrd-tesserocr-recognize/deu.traineddata
ocrd-tesserocr-recognize/GT4HistOCR_worddawg.traineddata
ocrd-tesserocr-recognize/Greek.traineddata
ocrd-tesserocr-recognize/script
ocrd-tesserocr-recognize/GT4HistOCR_300000.traineddata
ocrd-tesserocr-recognize/Latin.traineddata
ocrd-tesserocr-recognize/grc.traineddata
ocrd-tesserocr-recognize/lat.traineddata
ocrd-tesserocr-recognize/tessconfigs
ocrd-tesserocr-recognize/pdf.ttf
ocrd-tesserocr-recognize/equ.traineddata
ocrd-tesserocr-recognize/GT4HistOCR.traineddata
ocrd-tesserocr-recognize/osd.traineddata
ocrd-tesserocr-recognize/slk_frak.traineddata
ocrd-tesserocr-recognize/chi_sim_vert.traineddata
ocrd-tesserocr-recognize/deu0.traineddata
ocrd-tesserocr-recognize/eng.traineddata
ocrd-tesserocr-recognize/foo.traineddata
ocrd-tesserocr-recognize/chi_tra.traineddata
ocrd-tesserocr-recognize/configs/wordstrbox
ocrd-tesserocr-recognize/configs/api_config
ocrd-tesserocr-recognize/configs/lstm.train
ocrd-tesserocr-recognize/configs/box.train.stderr
ocrd-tesserocr-recognize/configs/digits
ocrd-tesserocr-recognize/configs/txt
ocrd-tesserocr-recognize/configs/unlv
ocrd-tesserocr-recognize/configs/inter
ocrd-tesserocr-recognize/configs/makebox
ocrd-tesserocr-recognize/configs/linebox
ocrd-tesserocr-recognize/configs/kannada
ocrd-tesserocr-recognize/configs/lstmbox
ocrd-tesserocr-recognize/configs/bigram
ocrd-tesserocr-recognize/configs/hocr
ocrd-tesserocr-recognize/configs/alto
ocrd-tesserocr-recognize/configs/ambigs.train
ocrd-tesserocr-recognize/configs/rebox
ocrd-tesserocr-recognize/configs/tsv
ocrd-tesserocr-recognize/configs/logfile
ocrd-tesserocr-recognize/configs/box.train
ocrd-tesserocr-recognize/configs/quiet
ocrd-tesserocr-recognize/configs/lstmdebug
ocrd-tesserocr-recognize/configs/strokewidth
ocrd-tesserocr-recognize/configs/pdf
ocrd-tesserocr-recognize/configs/get.images
ocrd-tesserocr-recognize/fast/Fraktur_50000000.502_198857.traineddata
ocrd-tesserocr-recognize/fast/Fraktur_50000000.334_450937.traineddata
ocrd-tesserocr-recognize/script/Fraktur.traineddata
ocrd-tesserocr-recognize/script/Latin.traineddata
ocrd-tesserocr-recognize/tessconfigs/msdemo
ocrd-tesserocr-recognize/tessconfigs/nobatch
ocrd-tesserocr-recognize/tessconfigs/segdemo
ocrd-tesserocr-recognize/tessconfigs/matdemo
ocrd-tesserocr-recognize/tessconfigs/batch

Notice the lines:

ocrd-tesserocr-recognize/configs
ocrd-tesserocr-recognize/script
ocrd-tesserocr-recognize/fast

We still show the files in configs though, not just the models. But that's something we'll have to live with I think.

Yes. (Perhaps we even want to offer them as additional parameter tesseract_config?)

kba commented 2 years ago

Are you sure you have reinstalled or installed with pip install -e. I do not see the directories with content-type: application/octet-stream:

ocrd-tesserocr-recognize -L
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/equ.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/ONB.traineddata                                                                                 
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/osd.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/Latin.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/frk.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/fra.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/deu.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/nld.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/eng.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/Fraktur.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/Fraktur_GT4HistOCR.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/linebox
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/box.train.stderr
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/get.images
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/api_config
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/digits 
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/strokewidth
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/unlv   
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/box.train
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/alto   
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/quiet  
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/tsv    
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/makebox
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/inter  
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/kannada    
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/rebox                                                                                      
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/bigram 
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/hocr            
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/lstmbox   
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/lstm.train
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/bazaar
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/logfile    
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/ambigs.train
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/lstmdebug
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/Makefile.am
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/wordstrbox
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/pdf
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/txt    

(and we really need to sort this output)

bertsky commented 2 years ago

Are you sure you have reinstalled or installed with pip install -e

Yes, I am sure. Clone of core is at 836eb05e791a196ea7f7e00b6f7438d74ebca582, I used make install, and venv reports:

ocrd, version 2.30.0

I can still see the directory entries.

(and we really need to sort this output)

speaking of which: in #792 I have added sorting via https://github.com/OCR-D/core/pull/792/commits/d108dd06392caed06286b2c3ebb7ac491581e6a7, but that only gives sorted output of ocrd resmgr, not of TOOL -L. So I should probably move the line to ocrd_utils.list_all_resources