0x2447196 / raypeatarchive

20 stars 7 forks source link

Newsletters converted from PDF to text #9

Closed ghost closed 7 months ago

ghost commented 7 months ago

Converted PDFs to text using the PyPDF2 python framework (can upload code for that too). Most documents read okay but there are some transcriptions that are completely botched. This looks like it mostly has to do with the formatting of the newsletter i.e. 3 column with ads. I can manually remove the faulty ones in the interest of a clean repo but before I do that I wanted you to have a look at them.

Potential improvements:

ghost commented 7 months ago

On closer inspection even the ones in which the texts seemed okay get botched later on. Multiple words are joined together without spaces and highlights and notes within the actual newsletter mess it up. I'm closing this PR and will raise a PR for each converted newsletter one-by-one. Sorry for the trouble!