Document support for filetypes other than pdf, such as .csv, .xlsx, .docx, .ppt

zhongshuai-cao commented 1 year ago

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [x] feature request
- [] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Updated Oct 4 2023

This issue has two main components:

Understanding how to parse each filetype differently.

Determining how to display these files in the AnalysePanel's document viewer.
Currently, I'm grappling with displaying Excel files. The existing iframe supports only PDFs, and I'm exploring methods to render an Excel file from the storage account directly in the browser. This challenge will eventually encompass other Microsoft file formats like docx and ppt.

While I acknowledge there's much to be done in terms of developing the parser, any recommendations or insights on React components that can render Excel and other Microsoft file formats would be greatly valued.

Mention any other details that might be useful

This is a really cool demo and I like it so much and appreciate all the work! In the demo, only pdf files are supported. Is it intended to convert all other formats into pdf as a pre preprocessing step, or is there any plan to include other documentation formats? I think MS Office file types such as docx, excel, ppt., etc. are also important to the business. If these files are supported, could you update the sample data/documentation to reflex that?

Thank you!

PTTrazavi commented 1 year ago

I modified the code in prepdocs.py and it can upload txt file.

add this new argument:

parser.add_argument("--txtparser", action="store_true", help="parse txt files, note that all files in this transaction must be txt files.")

modify the code in get_document_text:

def get_document_text(filename):
    offset = 0
    page_map = []
    if args.txtparser:
        with open(filename, "r") as f:
            text = f.read()
            page_map.append((0, offset, text))
            offset += len(text)
    elif args.localpdfparser:
        reader = PdfReader(filename)
        pages = reader.pages
        for page_num, p in enumerate(pages):
            ......

It would be great if someone can help with MS word and ppt files.

MustafaMS365PK commented 1 year ago

I re-deployed the code today where I deleted all out of box PDF files and added my own tab delimited file with .txt extension and code failed with following error. Did I miss something?

azure.core.exceptions.HttpResponseError: (InvalidRequest) Invalid request. Code: InvalidRequest Message: Invalid request. Inner error: { "code": "InvalidContent", "message": "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats." }

ERROR: failed running post hooks: 'postprovision' hook failed with exit code: '1', Path: './scripts/prepdocs.sh'. : exit status 1

PTTrazavi commented 1 year ago

After you add argument --txtparser in prepdocs.py, add --txtparser in prepdocs.ps1 and execute it.

Start-Process -FilePath $venvPythonPath -ArgumentList "./scripts/prepdocs.py $cwd/data/* --storageaccount $env:AZURE_STORAGE_ACCOUNT --container $env:AZURE_STORAGE_CONTAINER --searchservice $env:AZURE_SEARCH_SERVICE --index $env:AZURE_SEARCH_INDEX --formrecognizerservice $env:AZURE_FORMRECOGNIZER_SERVICE --tenantid $env:AZURE_TENANT_ID --txtparser -v" -Wait -NoNewWindow

Remove all PDF files from the data folder, only put txt files in the data folder and it should work.

tickx-cegeka commented 1 year ago

Office files are only supported by form recognizer in REST API version "2022/06/30-preview". https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/concept-layout?view=form-recog-3.0.0#input-requirements

I can't find where to change to this version. Can someone help? Thanks!

PTTrazavi commented 1 year ago

@tickx-cegeka I think the answer to your question is here. line 12 of "infra/core/search/search-services.bicep"

tickx-cegeka commented 1 year ago

@tickx-cegeka I think the answer to your question is here. line 12 of "infra/core/search/search-services.bicep"

@PTTrazavi , Editing that line makes the deployment fail. I think it is also because this is from search-service and that version i sent was from form recognizer. I also tried editing the version in requirements.txt, but I again get an error. I think that preview version is just not supported anymore. I believe the best way to go forward is to convert ppt/docx/... to pdf first in some way?

DSgUY commented 1 year ago

What is the best practise to upload data from url scrapping?. Keep html tags? remove them? How can I populate tags in the indexer? Regards.

quidba7 commented 1 year ago

Hi, How should we modify prepdocs.py and prepdocs.ps1 or prepdocs.sh in order to add a CosmosDB source ?

pamelafox commented 1 year ago

@quidba7 I don't know of any PRs that add CosmosDB support to this repository, but there are other OpenAI samples out there that access CosmosDB. Here's an example from my colleague @sethjuarez who uses CosmosDB with PromptFlow in a demo: https://github.com/sethjuarez/contoso-trek

You could also check if the llamaindex package can be used to ingest CosmosDB as a source.

cwill83247 commented 1 year ago

@PTTrazavi thanks for adding this post, I am following your suggestions, however keep getting various errors i. e positional parameter cannot be found that accepts argument --txtparser , I am clearly missing something could you share your code for prepdocs.py and prepdocs.ps1 or share the relevant parts, would be very much appreciated

zhongshuai-cao commented 1 year ago

This issue has two main components:

Understanding how to parse each filetype differently.
Determining how to display these files in the AnalysePanel's document viewer.

Currently, I'm grappling with displaying Excel files. The existing iframe supports only PDFs, and I'm exploring methods to render an Excel file from the storage account directly in the browser. This challenge will eventually encompass other Microsoft file formats like docx and ppt.

While I acknowledge there's much to be done in terms of developing the parser, any recommendations or insights on React components that can render Excel and other Microsoft file formats would be greatly valued.

zhongshuai-cao commented 1 year ago

I just went through react-doc-viewer repo and found this issue that blob excel is not able to render. Really appreciate any guidance to render it. Currently have to work on a excel-csv conversion in order to render the source...

https://github.com/cyntler/react-doc-viewer/issues/61#issuecomment-1333792820

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.

oracle-code commented 9 months ago

What code we need to add or change to make the app consider all types of documents in the same folder : pdf,doc,docx,ppt,txt,csv.

pamelafox commented 9 months ago

We will soon be merging two relevant PRs:

https://github.com/Azure-Samples/azure-search-openai-demo/pull/1195 - Support for JSONParser https://github.com/Azure-Samples/azure-search-openai-demo/pull/1159 - Support for Integrated Vectorization

Once those are merged, it will be easier for you to define local parsers (like the JSONParser) but also possible for you to use the Azure AI search integrated vectorization which does document cracking in the cloud and supports more than just PDFs. We will also soon enable Document Intelligence to parse a lot more formats: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-layout?view=doc-intel-4.0.0#input-requirements

Stay tuned!

oracle-code commented 9 months ago

This is what I dont understand, the documentation mention that document intelligence already handles the read of docx, for exemple. Last night I tryed following the documentation to parse a docx file but 2 problems came. First it is only url but my file is either in a blob storage or saved locally in my machine. And second, it mentionns that this file format is not yet handled ....

pamelafox commented 9 months ago

@hammadilyes I'm experiencing the same issue, we may need to update to new doc intelligence SDK, I'm chatting with their team about it today.

pamelafox commented 9 months ago

Confirmed: https://github.com/Azure/azure-sdk-for-python/issues/33278 I'll make a branch using the new SDK today.

pamelafox commented 9 months ago

See PR for changes needed to parse docx pptx and xlsx with Azure Document Intelligence:

https://github.com/Azure-Samples/azure-search-openai-demo/pull/1224

zhongshuai-cao commented 9 months ago

Fantastic to hear! I'm curious if the front end is capable of displaying the original document in the same way it does for PDFs. It appears that office files stored in private blob storage are unable to utilize Microsoft's online viewer for content display.

pamelafox commented 9 months ago

Good question. At least in Edge, it will not render in the iframe and will instead download:

That's the same experience that happens if I try to view the xlsx in Edge from my desktop. So I don't think it's about Blob storage, it's about how Edge views an xlsx file. If you stored a link in the index to the cloud-based file, like in oneDrive, then that could open in online Excel. Alternatively you could make previews of the files as PDFs or images, if it was important to see in the browser.

zc4242 commented 9 months ago

Good question. At least in Edge, it will not render in the iframe and will instead download:

That's the same experience that happens if I try to view the xlsx in Edge from my desktop. So I don't think it's about Blob storage, it's about how Edge views an xlsx file. If you stored a link in the index to the cloud-based file, like in oneDrive, then that could open in online Excel. Alternatively you could make previews of the files as PDFs or images, if it was important to see in the browser.

I see the challenges with the iframe limitation and have personally attempted to address this by experimenting with React Doc Viewer. Unfortunately, there's a known issue when trying to use Microsoft's online viewer through it, and it seems there aren't many effective alternatives for rendering Office documents correctly with non-Microsoft services. My last attempt involved splitting Excel worksheets into separate CSV files, each named by sheet and index, and feeding them to React Doc Viewer. For DOCX files, I converted them into text.

Ultimately, while this approach has its merits, I agree that we might need to reconsider the necessity of rendering if the user's main intent is to simply answer their query or if they are content with downloading the files for local verification. This decision acknowledges the variance in user experience – for example, PDFs can be viewed directly in the browser, whereas other file formats prompt a download without immediate display.

pamelafox commented 9 months ago

Thanks for sharing your experiments! What do other developers in this thread want, UI-wise? We could replace the iframe with a simple download link in the case of files that end with known-unrenderable extensions, for a simple approach.

oracle-code commented 9 months ago

Sorry for asking again, but I am confused. Now can we use more than just pdf's? If yes, what file and change do we need to make to the code, please? Much appreciated @pamelafox and the rest of the team. The work you guys are doing is GREAT!

pamelafox commented 9 months ago

@hammadilyes You can try out the new code by either manually copying in the changes from the pull request or by using the GitHub CLI to check it out locally:

gh pr checkout 1224

Then you need to run azd provision to create a Document Intelligence instance in the right region, and then ./scripts/prepdocs.sh. If there are any un-processed files in data that the new Doc Intel knows how to process, it should find them.

oracle-code commented 9 months ago

We use canadaeast for our region but for document intelligence we would use canadacentral would it work ?

pamelafox commented 9 months ago

The new documentintelligence SDK is only available in three regions, eastus, westus2, and westeurope, so you'd need to deploy to one of those regions. It's fine to have your resource group in one region and a resource inside in a different region, however.

oracle-code commented 9 months ago

Ok is Microsoft planning to consider making it available in Canada? Some companies and governments have concerns about having it outside of Canada.

pamelafox commented 9 months ago

I can't guarantee anything, but I generally expect to see it available in many more regions once it's past the preview stage. More on https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-layout?view=doc-intel-4.0.0

oracle-code commented 9 months ago

Awesome! Keep it up guys you are doing a great job :)

pamelafox commented 9 months ago

It's now merged! See release notes for today:

https://github.com/Azure-Samples/azure-search-openai-demo/releases/tag/2024-02-15b

FranciscoMMendes commented 8 months ago

Hi, I have updated the project to enable the document Intelligence to read other type of files. However when I try to upload a csv file I get an error stating:

azure.core.exceptions.HttpResponseError: (InvalidRequest) Invalid request. Code: InvalidRequest Message: Invalid request. Inner error: { "code": "InvalidContent", "message": "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats." }

I have already added the .csv to list in the file_processors of prepdocs.py as shown below.

Do I need to make any further changes to the code to make this work?

pamelafox commented 8 months ago

Document Intelligence does not support CSV, as far as I know. However you could use our simple local TextParser() instead,

".csv": FileProcessor(TextParser(), sentence_text_splitter),

Azure-Samples / azure-search-openai-demo