Turn pdfs into txt - Githubissues

We can use https://pypdf2.readthedocs.io/en/3.0.0/index.html to easily extract text from pdf, but it will be noisy as it will contain captions, page numbers, etc...

Page numbers could be removed quite easily (depending on the format of the pdf), but it will difficult to find a standard solution to remove captions from any pdf

David suggested to just throw the whole text into the model, without caring to much about cleaning it

Daria-Oni / EcoHack-Babassu-bots

Turn pdfs into txt #12