Closed eroux closed 1 year ago
okay sure, I will look into it and make the changes
@ta4tsering the issue with the files has been fixed. Can you run the complete pipeline for s3://ocr.bdrc.io/Works/dc/W8LS31241/google_books/batch_2022/
and report if there's any issue?
okay sure I will
I517DCE99 is the output opf of the W8LS31241 Google Books. W8LS31241 has 13 volumes but only 10 exist in the bdrc library so to handle that I have made a change to the get_image_list () in BDRCGBFileProvider class to return [] empty list if the image_list is none. link to the change in hocr
Thanks! There are a few issues regarding the handlin of missing volumes: if you look at https://github.com/OpenPecha-Data/I517DCE99/blob/master/I517DCE99.opf/meta.yml#L111 you can see that there are no images for this volume on BUDA so it should not be looked at by the code at all. Also, since there's no corresponding base, it shouldn't appear in the bases
object of meta.yml. I'll update the query so that empty volumes are not returned by the buda api, but it should also be handled by the code
okay sure
once you're done with the changes, can you reimport I517DCE99 ? (or remove it and reimport W8LS31241 in a new opf)
okay I will make the changes and create a new opf for it and remove the old one
this work should have 10 imagegroups but it only has 6 and one of the imagegroup's information is missing.
in the buda_data from the get_buda_scan_info(work_id) in buda pai
thanks! Just pushed a fix
works fine now thanks
Opf link the opf of the work_id W8LS31241 from the google_books using hocr-formatter
thanks, it works well! https://library.bdrc.io/show/bdr:IE0OPIA6AC9CE2
We should update HOCRBDRCFileProvider with the following changes:
s3://ocr.bdrc.io/Works/66/W1KG10193/google_books/batch_2022/
instead ofwe now have
where
html.zip
containsOr well, that's the theory, in practice this is blocked by https://github.com/buda-base/ao-google-books/issues/38 .
@ta4tsering can you :
?