Closed bertsky closed 2 years ago
On the other hand, it will be very difficult to discern the two cases, especially during ResourceManager.add_to_user_database
(i.e. unknown resources).
Perhaps we should entirely rule out Tesseract-style subdirectory resources by definition (and prevent such installations during make install-models
)?
But even then, the current implementation of --show-resource
does not handle the directory case at all. It insists on is_file=True
, and cannot find any. (Files are not handled correctly, too – but that's another story, see #752.)
I think we should not dictate the file status a priori, only its existence. Then, for directories, one could list their contents.
Hey team! Please add your planning poker estimate with ZenHub @kba @lena-hinrichsen
So IMO a better behaviour would be to list all such directories recursively.
Would we really need full recursion to support tesseract-like subdirectories? From what I have seen, there isn't any deeper nesting than 1, e.g. script/Fraktur
or configs/alto
?
One way to reliably solve this and not output folders for processors that only accept files and vice versa would be to add a resources_are_files
boolean to the ocrd-tool.json (which would be false
e.g. for calamari or eynollah) and prune the candidates based on that. Or are there any processors (or possible fututure processors) that require both? It won't help to add this info to the resource entries in the resource_list.yml
, since we do not know about the resources before we're searching for them (or I might overlook a solution here).
But even then, the current implementation of
--show-resource
does not handle the directory case at all. It insists onis_file=True
, and cannot find any
What should we show when calling --show-resource
on a directory? Tar-gzipped folder contents?
One way to reliably solve this and not output folders for processors that only accept files and vice versa would be to add a
resources_are_files
boolean to the ocrd-tool.json (which would befalse
e.g. for calamari or eynollah) and prune the candidates based on that. Or are there any processors (or possible fututure processors) that require both? It won't help to add this info to the resource entries in theresource_list.yml
, since we do not know about the resources before we're searching for them (or I might overlook a solution here).
Yes, indeed, good idea. But then, why not just use content-type: text/directory
or format: directory
?
What should we show when calling
--show-resource
on a directory? Tar-gzipped folder contents?
Well, I don't know what exactly the use-case is for --show-resource
of files (presumably redirection to files, esp. if they are binary). But for directories it should at least list the contents (and yes, generating a tar stream would even make redirection possible).
Yes, indeed, good idea. But then, why not just use
content-type: text/directory
orformat: directory
?
format
is a property of string properties so it cannot be used on the level of a tool
in ocrd-tool.json
. content-type
is fairly broad, it might be misunderstood as the format of input or output data, rather than resources. How about resource-content-type
?
What should we show when calling
--show-resource
on a directory? Tar-gzipped folder contents?Well, I don't know what exactly the use-case is for
--show-resource
of files (presumably redirection to files, esp. if they are binary). But for directories it should at least list the contents (and yes, generating a tar stream would even make redirection possible).
I am not so sure myself anymore. I was vaguely thinking about verifying that a certain resource/model a processor uses is indeed at the exact same version/checksum one expects. Or to share binary data of a resource withouot using the file system. But these are at best potential use cases. Should we drop the feature? Or replace it with an arguably more sensible --show-resource-location
which outputs the full path to the resource, like --list-resources
does?
Yes, indeed, good idea. But then, why not just use
content-type: text/directory
orformat: directory
?
format
is a property of string properties so it cannot be used on the level of atool
inocrd-tool.json
.content-type
is fairly broad, it might be misunderstood as the format of input or output data, rather than resources. How aboutresource-content-type
?
Sorry, I should have been more specific: I do mean string
type parameters. We currently use them with format=uri
and some content-type
for model files. But we could also require that model directories use format=uri
with content-type=text/directory
to signify that fact.
I am not so sure myself anymore. I was vaguely thinking about verifying that a certain resource/model a processor uses is indeed at the exact same version/checksum one expects. Or to share binary data of a resource withouot using the file system. But these are at best potential use cases. Should we drop the feature?
I do think these use cases are valid (if mere potential at this point). But mind that due to #752 they don't work so far.
Or replace it with an arguably more sensible
--show-resource-location
which outputs the full path to the resource, like--list-resources
does?
No, why? The latter already outputs the full path of the resource.
Sorry, I should have been more specific: I do mean
string
type parameters. We currently use them withformat=uri
and somecontent-type
for model files. But we could also require that model directories useformat=uri
withcontent-type=text/directory
to signify that fact.
Oh, right, yes, that is possible and makes sense. I'll adapt https://github.com/OCR-D/spec/pull/189.
This does make it slightly more difficult to implement and could possibly introduce ambiguity (a processor might define one file parameter as text/directory
and another as application/json
) though. Maybe adding a property for_parameter
to the ocrd_resources.yml
schema could help here.
But since gitter is down and the resource-content-type
changes are obsolete, I'll think about it some more ;-)
Or replace it with an arguably more sensible
--show-resource-location
which outputs the full path to the resource, like--list-resources
does?No, why? The latter already outputs the full path of the resource.
Mostly because it at least would fullfil its stated purpose, giving you the full path to a resource by name and would not require different behaviors for files vs. directories.
This does make it slightly more difficult to implement and could possibly introduce ambiguity (a processor might define one file parameter as
text/directory
and another asapplication/json
) though. Maybe adding a propertyfor_parameter
to theocrd_resources.yml
schema could help here.
Oh, I haven't thought about that yet. for_parameter
sounds clean but would make managing the database (think renaming paramers) and usage of resmgr a lot more complicated.
Perhaps resource manager could just guess the type from the path name (perhaps restricted by the ocrd-tool parameter mimetypes) or by peeking into the fd?
Or replace it with an arguably more sensible
--show-resource-location
which outputs the full path to the resource, like--list-resources
does?No, why? The latter already outputs the full path of the resource.
Mostly because it at least would fullfil its stated purpose, giving you the full path to a resource by name and would not require different behaviors for files vs. directories.
But --list-resources
does not need to behave differently, does it?
Or replace it with an arguably more sensible
--show-resource-location
which outputs the full path to the resource, like--list-resources
does?No, why? The latter already outputs the full path of the resource.
Mostly because it at least would fullfil its stated purpose, giving you the full path to a resource by name and would not require different behaviors for files vs. directories.
But
--list-resources
does not need to behave differently, does it?
No, that remains useful as-is and needs no change.
No, that remains useful as-is and needs no change.
But then I could just match the suffixes of list-resources
with the requested resource name – still no need for an extra --show-resource-location
.
@bertsky can be closed?
Currently for ocrd-tesserocr-recognize the subdirectories (configs
, fast
, script
) are listed both recursively and as files. But IIRC this will go away as soon as we overhaul its ocrd-tool.json parameter specs. Everything else seems to work fine now – thanks!
Currently for ocrd-tesserocr-recognize the subdirectories (
configs
,fast
,script
) are listed both recursively and as files. But IIRC this will go away as soon as we overhaul its ocrd-tool.json parameter specs. Everything else seems to work fine now – thanks!
With content-type
set to something that isn't text/directory
, the directories won't show. We still show the files in configs
though, not just the models. But that's something we'll have to live with I think.
With
content-type
set to something that isn'ttext/directory
, the directories won't show.
I do see them though:
ocrd-tesserocr-recognize/GT4HistOCR_100000.traineddata
ocrd-tesserocr-recognize/Fraktur_GT4HistOCR.traineddata
ocrd-tesserocr-recognize/Fraktur.traineddata
ocrd-tesserocr-recognize/chi_sim.traineddata
ocrd-tesserocr-recognize/frk.traineddata
ocrd-tesserocr-recognize/configs
ocrd-tesserocr-recognize/ONB.traineddata
ocrd-tesserocr-recognize/deu-frak.traineddata
ocrd-tesserocr-recognize/GT4HistOCR_2000000.traineddata
ocrd-tesserocr-recognize/frak2021.traineddata
ocrd-tesserocr-recognize/fast
ocrd-tesserocr-recognize/chi_tra_vert.traineddata
ocrd-tesserocr-recognize/deu.traineddata
ocrd-tesserocr-recognize/GT4HistOCR_worddawg.traineddata
ocrd-tesserocr-recognize/Greek.traineddata
ocrd-tesserocr-recognize/script
ocrd-tesserocr-recognize/GT4HistOCR_300000.traineddata
ocrd-tesserocr-recognize/Latin.traineddata
ocrd-tesserocr-recognize/grc.traineddata
ocrd-tesserocr-recognize/lat.traineddata
ocrd-tesserocr-recognize/tessconfigs
ocrd-tesserocr-recognize/pdf.ttf
ocrd-tesserocr-recognize/equ.traineddata
ocrd-tesserocr-recognize/GT4HistOCR.traineddata
ocrd-tesserocr-recognize/osd.traineddata
ocrd-tesserocr-recognize/slk_frak.traineddata
ocrd-tesserocr-recognize/chi_sim_vert.traineddata
ocrd-tesserocr-recognize/deu0.traineddata
ocrd-tesserocr-recognize/eng.traineddata
ocrd-tesserocr-recognize/foo.traineddata
ocrd-tesserocr-recognize/chi_tra.traineddata
ocrd-tesserocr-recognize/configs/wordstrbox
ocrd-tesserocr-recognize/configs/api_config
ocrd-tesserocr-recognize/configs/lstm.train
ocrd-tesserocr-recognize/configs/box.train.stderr
ocrd-tesserocr-recognize/configs/digits
ocrd-tesserocr-recognize/configs/txt
ocrd-tesserocr-recognize/configs/unlv
ocrd-tesserocr-recognize/configs/inter
ocrd-tesserocr-recognize/configs/makebox
ocrd-tesserocr-recognize/configs/linebox
ocrd-tesserocr-recognize/configs/kannada
ocrd-tesserocr-recognize/configs/lstmbox
ocrd-tesserocr-recognize/configs/bigram
ocrd-tesserocr-recognize/configs/hocr
ocrd-tesserocr-recognize/configs/alto
ocrd-tesserocr-recognize/configs/ambigs.train
ocrd-tesserocr-recognize/configs/rebox
ocrd-tesserocr-recognize/configs/tsv
ocrd-tesserocr-recognize/configs/logfile
ocrd-tesserocr-recognize/configs/box.train
ocrd-tesserocr-recognize/configs/quiet
ocrd-tesserocr-recognize/configs/lstmdebug
ocrd-tesserocr-recognize/configs/strokewidth
ocrd-tesserocr-recognize/configs/pdf
ocrd-tesserocr-recognize/configs/get.images
ocrd-tesserocr-recognize/fast/Fraktur_50000000.502_198857.traineddata
ocrd-tesserocr-recognize/fast/Fraktur_50000000.334_450937.traineddata
ocrd-tesserocr-recognize/script/Fraktur.traineddata
ocrd-tesserocr-recognize/script/Latin.traineddata
ocrd-tesserocr-recognize/tessconfigs/msdemo
ocrd-tesserocr-recognize/tessconfigs/nobatch
ocrd-tesserocr-recognize/tessconfigs/segdemo
ocrd-tesserocr-recognize/tessconfigs/matdemo
ocrd-tesserocr-recognize/tessconfigs/batch
Notice the lines:
ocrd-tesserocr-recognize/configs
ocrd-tesserocr-recognize/script
ocrd-tesserocr-recognize/fast
We still show the files in
configs
though, not just the models. But that's something we'll have to live with I think.
Yes. (Perhaps we even want to offer them as additional parameter tesseract_config
?)
Are you sure you have reinstalled or installed with pip install -e
. I do not see the directories with content-type: application/octet-stream
:
ocrd-tesserocr-recognize -L
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/equ.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/ONB.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/osd.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/Latin.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/frk.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/fra.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/deu.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/nld.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/eng.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/Fraktur.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/Fraktur_GT4HistOCR.traineddata
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/linebox
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/box.train.stderr
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/get.images
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/api_config
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/digits
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/strokewidth
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/unlv
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/box.train
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/alto
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/quiet
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/tsv
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/makebox
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/inter
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/kannada
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/rebox
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/bigram
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/hocr
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/lstmbox
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/lstm.train
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/bazaar
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/logfile
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/ambigs.train
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/lstmdebug
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/Makefile.am
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/wordstrbox
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/pdf
/home/kba/.local/share/ocrd-resources/ocrd-tesserocr-recognize/configs/txt
(and we really need to sort this output)
Are you sure you have reinstalled or installed with
pip install -e
Yes, I am sure. Clone of core is at 836eb05e791a196ea7f7e00b6f7438d74ebca582, I used make install
, and venv reports:
ocrd, version 2.30.0
I can still see the directory entries.
(and we really need to sort this output)
speaking of which: in #792 I have added sorting via https://github.com/OCR-D/core/pull/792/commits/d108dd06392caed06286b2c3ebb7ac491581e6a7, but that only gives sorted output of ocrd resmgr
, not of TOOL -L
. So I should probably move the line to ocrd_utils.list_all_resources
In
ocrd_utils.list_all_resources
(used by a processor's--list-resources
and resmgr's--list-installed
), entries that are themselves directories do not get traversed recursively.That's okay if the processor takes directory resources (e.g. ocrd-calamari-recognize's h5/json checkpoint dirs), but not if its resources are files in subdirectories (e.g. ocrd-tesserocr-recognize's
script/*.traineddata
orconfigs/*
).So IMO a better behaviour would be to list all such directories recursively.