Open SkanderHellal opened 1 year ago
I think this might be a good idea. Here's something I put together for OCRing random PDF files for Mick last week:
https://github.com/bitplane/ocr-pdf
It's a bit raw and hasn't been used in anger, but it's for a similar idea. Feel free to use the code :)
This might be a possible duplicate: https://github.com/LAION-AI/Open-Assistant/issues/1927
Also note that tools like pdfplumber or textract can be used for this task
I would like to contribute to the project by extracting data from Arxiv.
I would like to extract titles and abstracts or other metadata that might be helpul.
I think extracting the whole research paper text is not obvious as we cannot control the text length and we should extract text using OCR or other techniques. Therefore, I would like to start with titles and abstracts extractions and I will think if we can also extract figures and tables in a futher step.
what do you think of the approach?