Closed Aisuko closed 1 month ago
Please check here for proceed raw data. Most multimodel GenAI can work with images directly, so no extraction step needed.
cc @moonxjz
We finish the data extraction to convert all the PDFs to images. The next step would be the data extraction from images.
Reference article
Please check https://medium.com/gopenai/simple-ways-to-parse-pdfs-for-better-rag-systems-82ec68c9d8cd
Data extraction with LLM(if useful)
https://medium.com/gopenai/day-15-function-calling-and-data-extraction-with-llms-20f70570c44c
https://medium.com/@krtarunsingh/ai-and-llm-for-document-extraction-simplifying-complex-formats-with-ease-b3261b5be58e
Dataset
https://huggingface.co/datasets/aisuko/table_sports
Delivery result
Kaggle notebook for extracting your data of PDFs to mark-down.
@moonxjz @cbh778899
cc: @Micost If you have time, help them implement the article on Kaggle and share with them