Open 10zinten opened 1 year ago
Ah ok yes. It's an artefact in BDRC's database, see https://github.com/OpenPecha/Toolkit/blob/master/openpecha/buda/api.py#L189
Do we really have to reinvent the wheel every time? It looks like every time something starts working we throw it away and start a new code base that just reproduces each and every bug that we fixed in the first code...
But GoogleVisionBDRCFileProvider
is not using this to convert imagegroup to folder name. Therefore, formatter can't find the image hence, we getting empty text file.
so even if we run google ocr manually, this issue will still persist because we have to use latest Google Vision formatter but it can't ind the ocr outputs because of imagegroup to folder name conversion.
do I really need to fix it myself or can you do it? If I need to do it I'll rewrite a lot of the code, but that's fine, BDRC really needs a way to run OCR
@eroux I will look into it asap. Its on me as I was there during changes. @10zinten isn't familiar with this part of the code.
In case of
W14322
, image downloader saves image with imagegroup with prefixI
and GoogleVisionformatter is looking for image with imagegroup without the
I` prefixfor eg:
/home/django/app_data/ocr_pipeline/data/ocr_outputs/W14322/W14322-I5602/56020483.json.gz
-> saved by image downloader/home/django/app_data/ocr_pipeline/data/ocr_outputs/W14322/W14322-5602/56020483.json.gz
-> looked for by Google Vision Formatter