SkywardAI / paper_gallery

Papers gallery for using LLMs ability over dataset
MIT License
1 stars 0 forks source link

Data extraction from PDFs/imgs to mark-down #2

Closed Aisuko closed 2 months ago

Aisuko commented 2 months ago

Reference article

Please check https://medium.com/gopenai/simple-ways-to-parse-pdfs-for-better-rag-systems-82ec68c9d8cd

Data extraction with LLM(if useful)

Dataset

https://huggingface.co/datasets/aisuko/table_sports

Delivery result

Kaggle notebook for extracting your data of PDFs to mark-down.

@moonxjz @cbh778899

cc: @Micost If you have time, help them implement the article on Kaggle and share with them

cbh778899 commented 2 months ago

Please check here for proceed raw data. Most multimodel GenAI can work with images directly, so no extraction step needed.

Aisuko commented 2 months ago

cc @moonxjz

Aisuko commented 2 months ago

We finish the data extraction to convert all the PDFs to images. The next step would be the data extraction from images.