NVIDIA-AI-IOT / jetson-copilot

A reference application for a local AI assistant with LLM and RAG
Apache License 2.0
89 stars 14 forks source link

RAG file reading - Only PDF could be read #6

Closed lenoardshannon closed 4 months ago

lenoardshannon commented 4 months ago

Hello tokk-nv

When I just decide to create my own RAG, I decided to upload certain files include .pdf .docx .xlsx .pptx

Screenshot from 2024-07-01 15-45-12 image

It tells me I have to install openpyxl and python-pptx Pillow,but I already have this pip denpencies, could you help me to solve this issuse ?

tokk-nv commented 4 months ago

Hello lenoardshannon, Thank you for reaching out.

Jetson Copilot relies on Llama Index's Simple Directory Reader to read documents of multiple file types, but it seems that it requires some additional Python packages to support each files type.

We would need to install those additional Python packages in the Docker container image, but for now, will you try the following method?

  1. Start the container without launching Jetson Copilot.

    cd jetson-copilot
    ./launch_dev.sh
  2. Once in the container, install the Python packages with pip, then start the Jetson Copilot.

pip3 install openpyxl python-pptx Pillow
streamlit run app.py
  1. Enable RAG and click on "➕ Build a new index" to re-try creating your index based on your .docx and .pptx files.

Supporting Excel file format may require some addtional things (link), so you may want to first try only the Word and PowerPoint files.

I will try to create a new Docker container image and plan to update it, but in the mean time, it would be great if you can give a try on above method and let me know how it goes.

Regards, Chitoku

lenoardshannon commented 4 months ago

Hello tokk-nv

Thanks for answering my issue.

I have follow the steps you mentioned and instaledl some dependencies that terminal prompted me .

  1. python-pptx Pillow pip install torch transformers python-pptx Pillow

  2. docx2txt & openpyxl pip install docx2txt openpyxl

After this the RAG index could read .pdf & doc & .ppt files but excel issue still need your update.

Anyway thanks for updating this project, hope you going well.

Best regards, Leonard

tokk-nv commented 4 months ago

Hi Leonard,

We have updated this repo with the fix (https://github.com/NVIDIA-AI-IOT/jetson-copilot/pull/8) and the container image (link).

Please pull the latest of this repo and try building indexes with .docx, .xlsx and .pptx files.

lenoardshannon commented 4 months ago

Hi tokk-nv

I just find the copilot could parsing docx,xlsx,pptx files.

Thanks for updating the docker image !

Leonard

tokk-nv commented 4 months ago

Thank you, Leonard, for confirming!