DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
9.5k stars 452 forks source link

Support Excel files #258

Open ImadSaddik opened 1 week ago

ImadSaddik commented 1 week ago

Hello,

First of all, thank you for open-sourcing this fantastic project. It already offers a lot in its current state. I have a feature request: would it be possible to add support for Excel files in the near future?

I believe this would make the library even more complete. While there are some areas that could use improvement, I’m confident things will keep getting better over time. I’d love to hear your thoughts on this, and perhaps you're already considering Excel file support.

Thanks again,
SADDIK Imad

ViCtOr-dev13 commented 1 week ago

Hello @ImadSaddik , Did you find a way to extract informations from excel file ? Does it possible to convert it into html or pdf to process it ?

ImadSaddik commented 1 week ago

Hi @ViCtOr-dev13, so far docling does not support Excel files. If you want, you can use LangChain to load the parse the Excel docs, but I don't have a lot of experience with that.

psychicDivine commented 6 days ago

@ViCtOr-dev13 , there are multiple options available. I'm not sure about your specific use case, but you could consider using Langchain's document loaders or Llama Index's readers like DocxReader (https://docs.llamaindex.ai/en/stable/api_reference/readers/file/#llama_index.readers.file.DocxReader).

PeterStaar-IBM commented 3 days ago

We need to leverage the openpyx library.

ImadSaddik commented 3 days ago

Indeed, it will be challenging to cover all cases but if we can have something that improves overtime that is going to be good 😊

PeterStaar-IBM commented 3 days ago

@ImadSaddik Feel free to start with the implementation. I could also start with a simple backend and then we collaborate.

ImadSaddik commented 3 days ago

Sounds good, let's do it 👍🏻

PeterStaar-IBM commented 1 day ago

@ImadSaddik I started something in this PR: https://github.com/DS4SD/docling/pull/334

ImadSaddik commented 1 day ago

Thank you @PeterStaar-IBM for letting me know. I have been busy with work lately, I will look into it once I get the time.

PeterStaar-IBM commented 1 day ago

@ImadSaddik Just waiting for a review now on PR: #334 , should be in sometime next week!

FYI: @dolfim-ibm @cau-git

ImadSaddik commented 1 day ago

@PeterStaar-IBM, I will test what you did and provide feedback