Error extracting sentences from Japanese PDFs

emesterhazy / glossika-to-anki

Convert Glossika PDFs and audio files into Anki decks

MIT License

32 stars 8 forks source link

Error extracting sentences from Japanese PDFs #2

Closed WannabeNihonjin closed 6 years ago

WannabeNihonjin commented 6 years ago

I really have no clue what I'm doing wrong, I've been working on this for 2+ hours. I have the mp3split folder in the python folder and when i use the glossika_split_audio all that happens is python pops up for half a second and then goes away.

Please help

WannabeNihonjin commented 6 years ago

I found someone who had used this before and he got the audio files to split no problem(still dont know what I did wrong)but now the genanki file doesnt do anything when I have both the split files and the txt doctuments in the output folder. Have no clue what to do.

deepakjois commented 6 years ago

So I am the person who has been helping @WannabeNihonjin

The Glosssika PDFs throw an error during the extraction step. Here is a sample output on my setup:

$ python glossika_extract_pdf.py 

Processing 3 files...

Processing  GLOSSIKA-ENJA-F1-EBK.pdf...error
Something went wrong...found 0 sentences instead of 1,000
Processing  GLOSSIKA-ENJA-F2-EBK.pdf...error
Something went wrong...found 0 sentences instead of 1,000
Processing  GLOSSIKA-ENJA-F3-EBK.pdf...error
Something went wrong...found 0 sentences instead of 1,000
PDF extract complete!

It looks like it needs a similar fix to Cantonese that you made earlier.

emesterhazy commented 6 years ago

Thanks @deepakjois. The script identifies the beginning of each sentence by looking for a character or set or characters that indicate the start of a phrase.

Based on the information you shared, I think the "日" character on this line should be changed to "JA".

'JA': ['EN', '日', 'ROM'] # Japanese (before) 'JA': ['EN', 'JA', 'ROM'] # Japanese (after)

Can you test this change and let me know if it works? I'll investigate whether some versions of the PDFs use 日 instead of JA before pushing a fix to the repo.

deepakjois commented 6 years ago

I made that change in the script here, and it seems to work:

$ python glossika_extract_pdf.py 

Processing 3 files...

Processing  GLOSSIKA-ENJA-F1-EBK.pdf...complete
Processing  GLOSSIKA-ENJA-F2-EBK.pdf...complete
Processing  GLOSSIKA-ENJA-F3-EBK.pdf...complete
PDF extract complete!

@WannabeNihonjin please check your email for Anki deck.

WannabeNihonjin commented 6 years ago

Thank you both for all the help, the deck works perfectly.

emesterhazy commented 6 years ago

This is fixed now in the commit above.