emesterhazy / glossika-to-anki

Convert Glossika PDFs and audio files into Anki decks
MIT License
32 stars 8 forks source link

Cantonese Level 1 - PDF to TSV extraction missing sentences #16

Closed awhlam closed 2 years ago

awhlam commented 3 years ago

I ran the PDF to TSV extraction for Cantonese.

Levels 2 and 3 extracted all 1,000 sentences fine, but level 1 was missing 3 sentences in the output TSV (I modified the Python script to not abort).

Below is an example of a sentence that was missed:

Glossika PDF Capture

Output TSV Extraction: Capture2

Sentence #89 - has the Yale for sentence #90 instead Sentence #90 is missing

I think it might be an issue with the regular expressions? Let me know if you have any ideas or if you need more examples.

awhlam commented 3 years ago

The other two missing sentences have a similar issue as above:

Sentence #253 - Are you sitting on the floor? Sentence #255 - Are you feeling all right?

emesterhazy commented 3 years ago

Could you run pdftotext on the PDF and share a snippet that includes the missing sentences?

The command should be something like

pdftotext -layout -enc UTF-8 file_name.pdf output.txt
awhlam commented 3 years ago

@emesterhazy - Sure, here is the snippet of the pdftotext output that corresponds with the screenshot in my first post above.

emesterhazy commented 2 years ago

Closing due to archival of project. Best of luck!