Daria-Oni / EcoHack-Babassu-bots

0 stars 1 forks source link

Turn pdfs into txt #12

Closed Daria-Oni closed 1 month ago

Daria-Oni commented 1 month ago

how to make sure that we keep only important info? (for example we don't copy captions of the images or other 'dirty' data)

lucautunno commented 1 month ago

We can use https://pypdf2.readthedocs.io/en/3.0.0/index.html to easily extract text from pdf, but it will be noisy as it will contain captions, page numbers, etc...

Page numbers could be removed quite easily (depending on the format of the pdf), but it will difficult to find a standard solution to remove captions from any pdf

David suggested to just throw the whole text into the model, without caring to much about cleaning it