OpenPecha / Toolkit

🛠 Tools to create, edit and export texts and annotations
https://toolkit.openpecha.org
Apache License 2.0
7 stars 4 forks source link

new zipped format for google_books output #215

Closed eroux closed 1 year ago

eroux commented 1 year ago

We should update HOCRBDRCFileProvider with the following changes:

output/
   W1KG10193-I1KG10195/
      00000001.html
      00000002.html
      ...

we now have

output/
   W1KG10193-I1KG10195/
      html.zip

where html.zip contains

00000001.html
00000002.html
...

Or well, that's the theory, in practice this is blocked by https://github.com/buda-base/ao-google-books/issues/38 .

@ta4tsering can you :

?

ta4tsering commented 1 year ago

okay sure, I will look into it and make the changes

eroux commented 1 year ago

@ta4tsering the issue with the files has been fixed. Can you run the complete pipeline for s3://ocr.bdrc.io/Works/dc/W8LS31241/google_books/batch_2022/ and report if there's any issue?

ta4tsering commented 1 year ago

okay sure I will

ta4tsering commented 1 year ago

I517DCE99 is the output opf of the W8LS31241 Google Books. W8LS31241 has 13 volumes but only 10 exist in the bdrc library so to handle that I have made a change to the get_image_list () in BDRCGBFileProvider class to return [] empty list if the image_list is none. link to the change in hocr

eroux commented 1 year ago

Thanks! There are a few issues regarding the handlin of missing volumes: if you look at https://github.com/OpenPecha-Data/I517DCE99/blob/master/I517DCE99.opf/meta.yml#L111 you can see that there are no images for this volume on BUDA so it should not be looked at by the code at all. Also, since there's no corresponding base, it shouldn't appear in the bases object of meta.yml. I'll update the query so that empty volumes are not returned by the buda api, but it should also be handled by the code

ta4tsering commented 1 year ago

okay sure

eroux commented 1 year ago

once you're done with the changes, can you reimport I517DCE99 ? (or remove it and reimport W8LS31241 in a new opf)

ta4tsering commented 1 year ago

okay I will make the changes and create a new opf for it and remove the old one

ta4tsering commented 1 year ago
Screenshot 2022-11-22 at 11 59 04 AM

this work should have 10 imagegroups but it only has 6 and one of the imagegroup's information is missing.

ta4tsering commented 1 year ago

in the buda_data from the get_buda_scan_info(work_id) in buda pai

eroux commented 1 year ago

thanks! Just pushed a fix

ta4tsering commented 1 year ago

works fine now thanks

ta4tsering commented 1 year ago

Opf link the opf of the work_id W8LS31241 from the google_books using hocr-formatter

eroux commented 1 year ago

thanks, it works well! https://library.bdrc.io/show/bdr:IE0OPIA6AC9CE2