Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
6.13k stars 4.17k forks source link

QUESTION ABOUT FORMAT FILE #731

Closed phfontes closed 10 months ago

phfontes commented 1 year ago

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [ X] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Upload files in folder data and run azd up

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?) Windows 10

azd version?

run azd version and copy paste here. 1.2.0

Versions

Mention any other details that might be useful

Hello, I would like to ask a question. I'm doing some testing with html and I see that there is better performance in some situations. Of the tests/feedback they're having, what's the best file format?


Thanks! We'll be in touch soon.

phfontes commented 1 year ago

Hello, any news?

pamelafox commented 1 year ago

Sorry, not sure what you mean, are you talking about the file format for data ingestion? Or some other file format? We support PDF so am not sure what you mean by HTML.

phfontes commented 1 year ago

Hello Thank you for your feedback, sorry, in this case I mean, what is the best file format for the platform to have more assertive answers. Currently, I'm using HTML and I see better results. However, I still have doubts about the best format for more assertive answers.

pamelafox commented 1 year ago

Are you saying that you are ingesting HTML files? I'm not sure how that's possible, since prepdocs.py is set up only for ingesting PDF files. Did you first convert them to PDF? Please let us know how you've been working with HTML files.

phfontes commented 1 year ago

Hello, That's right, I'm creating a structure with html and sections, when there are topics, I use ul and li. I believed it could accept other formats, it must have been confusing, because in the data_utils file there is a part that says about file_format_dict and there is html. The structure I'm using is < html> < head> < /head> < body> < section> < h1>Title of topic < ul> < li>Microsoft < li>Azure < /ul>

      < /section>

< /body> < /html>

KoKyiSoe commented 1 year ago

Hello, I would like to also know whether HTML files are supported. Any update please?

pamelafox commented 1 year ago

@phfontes I think you're referring to a different codebase, perhaps https://github.com/microsoft/sample-app-aoai-chatGPT ?

This repo doesn't directly support HTML so you must convert to PDF first. I have a script that does that here: https://github.com/pamelafox/html-to-pdf-converter/blob/main/main.py