emesterhazy / glossika-to-anki

Convert Glossika PDFs and audio files into Anki decks
MIT License
32 stars 8 forks source link

New to python and having problems understanding the process #17

Closed Devilgirl666 closed 2 years ago

Devilgirl666 commented 2 years ago

Hi, I'm having problems fully understanding how to use python/this code to create an anki deck. Mainly my problem is that I do not really know how to label my glossika pdf and audio files and where to put these files. Can you make the readme a little more descriptive please?

emesterhazy commented 2 years ago

It has been a while since I wrote this code and I agree, the readme could be more descriptive. I wrote the instructions thinking that users would have the original Glossika files. If your files have the original names Glossika shipped them with, everything should just work.

For the PDFs, if you run python3 glossika_extract_pdf.py, the script will create a directory called glossika_source/pdf where you can copy your PDF files. The files should be named like GLOSSIKA-ENZS-F1-EBK.pdf. The "ENZS" in this name indicates English-Mandarin, so your PDFs may have a different name depending on your language combination.

Keep in mind that the scripts currently support simplified and traditional Chinese, Cantonese, and Japanese. If you have a different language you'll need to update the PDF script here.

The audio files should be named something like ENZS-F1-GMS-C-0001.mp3. Again, the "ENZS" may vary depending on your language. Make sure you are using the GMS-C audio files. You'll need to copy the audio files into the glossika_source/audio file that the audio script creates the first time you run it.

Let me know how it goes :)

Devilgirl666 commented 2 years ago

Thanks so much for getting back to me. Sorry to bother you by the way but you code is the only thing I could find that was useful to convert Glossika into Anki. I would like to ask before I start which Xpdf program I should download? Just the XpdfReader? Also for mp3splt just Mp3split (2.6.2)? And how do I download genanki? As for adding the language what do I add for french and Russian? For instance, 'FR' : ['EN', ...? ], # French. What do I add in the middle? Thanks again!

Le jeu. 7 oct. 2021 à 5:31 PM, Evan Mesterhazy @.***> a écrit :

It has been a while since I wrote this code and I agree, the readme could be more descriptive. I wrote the instructions thinking that users would have the original Glossika files. If your files have the original names Glossika shipped them with, everything should just work.

For the PDFs, if you run python3 glossika_extract_pdf.py, the script will create a directory called glossika_source/pdf where you can copy your PDF files. The files should be named like GLOSSIKA-ENZS-F1-EBK.pdf. The "ENZS" in this name indicates English-Mandarin, so your PDFs may have a different name depending on your language combination.

Keep in mind that the scripts currently support simplified and traditional Chinese, Cantonese, and Japanese. If you have a different language you'll need to update the PDF script here https://github.com/emesterhazy/glossika-to-anki/blob/1ccc81d89fde3474ad0a2f5b84ccf65d21b44fa7/glossika-to-anki/glossika_extract_pdf.py#L16-L22 .

The audio files should be named something like ENZS-F1-GMS-C-0001.mp3. Again, the "ENZS" may vary depending on your language. Make sure you are using the GMS-C audio files. You'll need to copy the audio files into the glossika_source/audio file that the audio script creates the first time you run it.

Let me know how it goes :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/emesterhazy/glossika-to-anki/issues/17#issuecomment-938200677, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVYB2B3MS5FJ75KHYPYCXC3UFYNTTANCNFSM5ESULDTQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

emesterhazy commented 2 years ago

For the audio mp3split 2.6.2 will work - newer versions will probably work as well, but I haven't tested them. For the PDFs you need pdf2text. The readme has instructions for installing it on Windows, MacOS, and Linux.. hopefully the instructions are still accurate. Which OS are you using?

As far as adding the language, the dictionary key is the language identifier in the pdf name (i.e. "ZS" for Simplified Chinese). The list that you see, ['EN', '简', 'PIN'] for example, is a list of regexes matching the identifiers before the sentences in the PDF.

    languages = {
        'ZS': ['EN', '简', 'PIN'],  # Simplified Chinese
        'ZH': ['EN', '繁', 'PIN'],  # Traditional Chinese
        'ZT': ['EN', '繁', 'PIN'],  # Traditional Chinese (Taiwan)
        'YUE': ['EN', '粵', 'YALE'],  # Cantonese | Change YALE to JYUT for Jyutping
        'JA': ['EN', '日|JA', 'ROM']   # Japanese
    }

2021-10-07_20-03_1

Here's an example for Simplified Chinese. The script will extract the phrases after "EN", "简", and "PIN" separately so that they can be used in the Anki cards.

So, how you update the languages dictionary will depend on the symbols / words used to identify the various sentences. If you can post a small screenshot of one of the sentences I might be able to give you a suggestion.

Devilgirl666 commented 2 years ago

Sorry I'm using Windows 10, Python 3.10 (64-bit) I installed Mp3split 2.6.2. So do I add it and Pdftotext to AppData/Local/Programs/Python or AppData/Local/Programs/Python/Python310 ? And for Pdftotext it links to XpdfReader.com, so do I download just XpdfReader command line tools? Here's the example sentence. I think the Russian one might be in the older format so it won't work. PXL_20211008_003548572 PXL_20211008_004358133

emesterhazy commented 2 years ago

Closing due to archival of project. Best of luck!