OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
119 stars 31 forks source link

workspace.download_file - not downloading transitive files #1115

Open kba opened 1 year ago

kba commented 1 year ago

Noticed while fixing the broken tests in https://github.com/OCR-D/ocrd_kraken/pull/42:

Here, we use Resolver.workspace_from_url without download, which copies the mets.xml and nothing else.

@pytest.fixture()                                             
def workspace(tmpdir):                                        
    if os.path.exists(tmpdir):                                
        shutil.rmtree(tmpdir)                                 
    workspace = Resolver().workspace_from_url(                
        assets.path_to('kant_aufklaerung_1784/data/mets.xml'),
        dst_dir=tmpdir
    )                                                         
    return workspace                                          

In the processors, the PAGE-XML is downloaded via

pcgts = page_from_file(self.workspace.download_file(input_file)) 
image_url = pcgts.get_Page().imageFilename                       
# [...]
    image = self.workspace.resolve_image_as_pil(image_url)       

This is apparently broken because the image file is not downloaded and tests fail.

So either I debug this properly to find out why the baseurl mechanism does not work here or we finally get rid of the long-deprecated resolve_image_as_pil altogether.

bertsky commented 3 months ago

Here, we use Resolver.workspace_from_url without download, which copies the mets.xml and nothing else.

Like I already (later) said in #1149, cloning from local workspaces is still fundamentally broken.

In the processors, the PAGE-XML is downloaded via This is apparently broken because the image file is not downloaded and tests fail.

Like I already said in #809, the download changes the relative local path that the PAGE files might expect.

or we finally get rid of the long-deprecated resolve_image_as_pil altogether.

I cannot see anything wrong with that function itself.