Closed jllking closed 4 years ago
Oh, I see. I'm pretty new to programming, so I'm not sure if I'm gonna be able to pull it off. Wish me luck! :)
Take a look at how the current implementation works. You should start by OCR'ing the PDFs and converting them to text files, which the existing code does. After that, look for patterns that indicate the beginning of the sentences in each language. If you look at the existing source code you'll see how that's done for the v2 PDFs (you'll need regex). The process should be pretty similar for the v1 PDFs as well. Feel free to comment here if you get stuck.
Closing this issue for now. Feel free to reopen it if you still need help.
It won't work out of the box, but the existing code should provide a good template for making the necessary changes to support the v1 PDFs. If you'd like to add that support please feel free to open a pull request :)