Closed zhongshuai-cao closed 9 months ago
I modified the code in prepdocs.py and it can upload txt file.
add this new argument:
parser.add_argument("--txtparser", action="store_true", help="parse txt files, note that all files in this transaction must be txt files.")
modify the code in get_document_text:
def get_document_text(filename):
offset = 0
page_map = []
if args.txtparser:
with open(filename, "r") as f:
text = f.read()
page_map.append((0, offset, text))
offset += len(text)
elif args.localpdfparser:
reader = PdfReader(filename)
pages = reader.pages
for page_num, p in enumerate(pages):
......
It would be great if someone can help with MS word and ppt files.
I re-deployed the code today where I deleted all out of box PDF files and added my own tab delimited file with .txt extension and code failed with following error. Did I miss something?
azure.core.exceptions.HttpResponseError: (InvalidRequest) Invalid request. Code: InvalidRequest Message: Invalid request. Inner error: { "code": "InvalidContent", "message": "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats." }
ERROR: failed running post hooks: 'postprovision' hook failed with exit code: '1', Path: './scripts/prepdocs.sh'. : exit status 1
After you add argument --txtparser in prepdocs.py, add --txtparser in prepdocs.ps1 and execute it.
Start-Process -FilePath $venvPythonPath -ArgumentList "./scripts/prepdocs.py $cwd/data/* --storageaccount $env:AZURE_STORAGE_ACCOUNT --container $env:AZURE_STORAGE_CONTAINER --searchservice $env:AZURE_SEARCH_SERVICE --index $env:AZURE_SEARCH_INDEX --formrecognizerservice $env:AZURE_FORMRECOGNIZER_SERVICE --tenantid $env:AZURE_TENANT_ID --txtparser -v" -Wait -NoNewWindow
Remove all PDF files from the data folder, only put txt files in the data folder and it should work.
Office files are only supported by form recognizer in REST API version "2022/06/30-preview". https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/concept-layout?view=form-recog-3.0.0#input-requirements
I can't find where to change to this version. Can someone help? Thanks!
@tickx-cegeka I think the answer to your question is here. line 12 of "infra/core/search/search-services.bicep"
@tickx-cegeka I think the answer to your question is here. line 12 of "infra/core/search/search-services.bicep"
@PTTrazavi , Editing that line makes the deployment fail. I think it is also because this is from search-service and that version i sent was from form recognizer. I also tried editing the version in requirements.txt, but I again get an error. I think that preview version is just not supported anymore. I believe the best way to go forward is to convert ppt/docx/... to pdf first in some way?
What is the best practise to upload data from url scrapping?. Keep html tags? remove them? How can I populate tags in the indexer? Regards.
Hi,
How should we modify prepdocs.py
and prepdocs.ps1
or prepdocs.sh
in order to add a CosmosDB source ?
@quidba7 I don't know of any PRs that add CosmosDB support to this repository, but there are other OpenAI samples out there that access CosmosDB. Here's an example from my colleague @sethjuarez who uses CosmosDB with PromptFlow in a demo: https://github.com/sethjuarez/contoso-trek
You could also check if the llamaindex package can be used to ingest CosmosDB as a source.
@PTTrazavi thanks for adding this post, I am following your suggestions, however keep getting various errors i. e positional parameter cannot be found that accepts argument --txtparser , I am clearly missing something could you share your code for prepdocs.py and prepdocs.ps1 or share the relevant parts, would be very much appreciated
This issue has two main components:
Currently, I'm grappling with displaying Excel files. The existing iframe supports only PDFs, and I'm exploring methods to render an Excel file from the storage account directly in the browser. This challenge will eventually encompass other Microsoft file formats like docx and ppt.
While I acknowledge there's much to be done in terms of developing the parser, any recommendations or insights on React components that can render Excel and other Microsoft file formats would be greatly valued.
I just went through react-doc-viewer repo and found this issue that blob excel is not able to render. Really appreciate any guidance to render it. Currently have to work on a excel-csv conversion in order to render the source...
https://github.com/cyntler/react-doc-viewer/issues/61#issuecomment-1333792820
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.
What code we need to add or change to make the app consider all types of documents in the same folder : pdf,doc,docx,ppt,txt,csv.
We will soon be merging two relevant PRs:
https://github.com/Azure-Samples/azure-search-openai-demo/pull/1195 - Support for JSONParser https://github.com/Azure-Samples/azure-search-openai-demo/pull/1159 - Support for Integrated Vectorization
Once those are merged, it will be easier for you to define local parsers (like the JSONParser) but also possible for you to use the Azure AI search integrated vectorization which does document cracking in the cloud and supports more than just PDFs. We will also soon enable Document Intelligence to parse a lot more formats: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-layout?view=doc-intel-4.0.0#input-requirements
Stay tuned!
This is what I dont understand, the documentation mention that document intelligence already handles the read of docx, for exemple. Last night I tryed following the documentation to parse a docx file but 2 problems came. First it is only url but my file is either in a blob storage or saved locally in my machine. And second, it mentionns that this file format is not yet handled ....
@hammadilyes I'm experiencing the same issue, we may need to update to new doc intelligence SDK, I'm chatting with their team about it today.
Confirmed: https://github.com/Azure/azure-sdk-for-python/issues/33278 I'll make a branch using the new SDK today.
See PR for changes needed to parse docx pptx and xlsx with Azure Document Intelligence:
https://github.com/Azure-Samples/azure-search-openai-demo/pull/1224
Fantastic to hear! I'm curious if the front end is capable of displaying the original document in the same way it does for PDFs. It appears that office files stored in private blob storage are unable to utilize Microsoft's online viewer for content display.
Good question. At least in Edge, it will not render in the iframe and will instead download:
That's the same experience that happens if I try to view the xlsx in Edge from my desktop. So I don't think it's about Blob storage, it's about how Edge views an xlsx file. If you stored a link in the index to the cloud-based file, like in oneDrive, then that could open in online Excel. Alternatively you could make previews of the files as PDFs or images, if it was important to see in the browser.
Good question. At least in Edge, it will not render in the iframe and will instead download:
That's the same experience that happens if I try to view the xlsx in Edge from my desktop. So I don't think it's about Blob storage, it's about how Edge views an xlsx file. If you stored a link in the index to the cloud-based file, like in oneDrive, then that could open in online Excel. Alternatively you could make previews of the files as PDFs or images, if it was important to see in the browser.
I see the challenges with the iframe limitation and have personally attempted to address this by experimenting with React Doc Viewer. Unfortunately, there's a known issue when trying to use Microsoft's online viewer through it, and it seems there aren't many effective alternatives for rendering Office documents correctly with non-Microsoft services. My last attempt involved splitting Excel worksheets into separate CSV files, each named by sheet and index, and feeding them to React Doc Viewer. For DOCX files, I converted them into text.
Ultimately, while this approach has its merits, I agree that we might need to reconsider the necessity of rendering if the user's main intent is to simply answer their query or if they are content with downloading the files for local verification. This decision acknowledges the variance in user experience – for example, PDFs can be viewed directly in the browser, whereas other file formats prompt a download without immediate display.
Thanks for sharing your experiments! What do other developers in this thread want, UI-wise? We could replace the iframe with a simple download link in the case of files that end with known-unrenderable extensions, for a simple approach.
Sorry for asking again, but I am confused. Now can we use more than just pdf's? If yes, what file and change do we need to make to the code, please? Much appreciated @pamelafox and the rest of the team. The work you guys are doing is GREAT!
@hammadilyes You can try out the new code by either manually copying in the changes from the pull request or by using the GitHub CLI to check it out locally:
gh pr checkout 1224
Then you need to run azd provision
to create a Document Intelligence instance in the right region, and then ./scripts/prepdocs.sh. If there are any un-processed files in data that the new Doc Intel knows how to process, it should find them.
We use canadaeast for our region but for document intelligence we would use canadacentral would it work ?
The new documentintelligence SDK is only available in three regions, eastus, westus2, and westeurope, so you'd need to deploy to one of those regions. It's fine to have your resource group in one region and a resource inside in a different region, however.
Ok is Microsoft planning to consider making it available in Canada? Some companies and governments have concerns about having it outside of Canada.
I can't guarantee anything, but I generally expect to see it available in many more regions once it's past the preview stage. More on https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-layout?view=doc-intel-4.0.0
Awesome! Keep it up guys you are doing a great job :)
It's now merged! See release notes for today:
https://github.com/Azure-Samples/azure-search-openai-demo/releases/tag/2024-02-15b
Hi, I have updated the project to enable the document Intelligence to read other type of files. However when I try to upload a csv file I get an error stating:
azure.core.exceptions.HttpResponseError: (InvalidRequest) Invalid request. Code: InvalidRequest Message: Invalid request. Inner error: { "code": "InvalidContent", "message": "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats." }
I have already added the .csv to list in the file_processors of prepdocs.py as shown below.
Do I need to make any further changes to the code to make this work?
Document Intelligence does not support CSV, as far as I know. However you could use our simple local TextParser() instead,
".csv": FileProcessor(TextParser(), sentence_text_splitter),
This issue is for a: (mark with an
x
)Updated Oct 4 2023
This issue has two main components:
Understanding how to parse each filetype differently.
While I acknowledge there's much to be done in terms of developing the parser, any recommendations or insights on React components that can render Excel and other Microsoft file formats would be greatly valued.
Mention any other details that might be useful
This is a really cool demo and I like it so much and appreciate all the work! In the demo, only pdf files are supported. Is it intended to convert all other formats into pdf as a pre preprocessing step, or is there any plan to include other documentation formats? I think MS Office file types such as docx, excel, ppt., etc. are also important to the business. If these files are supported, could you update the sample data/documentation to reflex that?
Thank you!