LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
37.1k stars 3.24k forks source link

Arxiv: Research papers #2076

Open SkanderHellal opened 1 year ago

SkanderHellal commented 1 year ago

I would like to contribute to the project by extracting data from Arxiv.

I would like to extract titles and abstracts or other metadata that might be helpul.

I think extracting the whole research paper text is not obvious as we cannot control the text length and we should extract text using OCR or other techniques. Therefore, I would like to start with titles and abstracts extractions and I will think if we can also extract figures and tables in a futher step.

what do you think of the approach?

bitplane commented 1 year ago

I think this might be a good idea. Here's something I put together for OCRing random PDF files for Mick last week:

https://github.com/bitplane/ocr-pdf

It's a bit raw and hasn't been used in anger, but it's for a similar idea. Feel free to use the code :)

CloseChoice commented 1 year ago

This might be a possible duplicate: https://github.com/LAION-AI/Open-Assistant/issues/1927

Also note that tools like pdfplumber or textract can be used for this task

Miserlou commented 1 year ago

Related: https://www.kaggle.com/datasets/Cornell-University/arxiv