Closed Daria-Oni closed 1 month ago
We can use https://pypdf2.readthedocs.io/en/3.0.0/index.html to easily extract text from pdf, but it will be noisy as it will contain captions, page numbers, etc...
Page numbers could be removed quite easily (depending on the format of the pdf), but it will difficult to find a standard solution to remove captions from any pdf
David suggested to just throw the whole text into the model, without caring to much about cleaning it
how to make sure that we keep only important info? (for example we don't copy captions of the images or other 'dirty' data)