Project creation: update sources in a different way

auroramariatumminello commented 1 year ago

Hi there,

I'm trying to create a project involving QnA using textual files. Despite reading the documentation and exploring the code, I find it very difficult to create sources in a suitable format for your code. In particular, when it comes to files and not URLs, it would be useful to pass the corpus of documents or a local path instead of a URL.

For instance, imagine a situation where I want to test your Q&A on my files on blob storage. I have to create a SAS token for the file and pass its URL and still, it will provide problems with both .txt and .pdf ((ValidationFailure) File name must be a valid file name with a supported extension.).

Would it be possible to understand how a file should be correctly passed as a source? Thank you

pvaneck commented 1 year ago

Thanks for the feedback. I've tagged someone who should be able to help with this.

ghost commented 1 year ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @bingisbestest, @nerajput1607.

Issue Details

Hi there, I'm trying to create a personal project of Q&A using textual files. Despite reading the documentation and exploring the code, I find it very difficult to create sources in a suitable format for your code. In particular, when it comes to files and not URLs, it would be useful to pass the corpus of documents or a local path instead of a URL. For instance, imagine a situation where I want to test your Q&A on my files on blob storage. I have to create a SAS token for the file and pass its URL and still, it will provide problems with both `.txt` and `.pdf` ((ValidationFailure) File name must be a valid file name with a supported extension.). Would it be possible to understand how a file should be correctly passed as a source? Thank you

Author:	auroramariatumminello
Assignees:	kristapratico
Labels:	`feature-request`, `question`, `Service Attention`, `customer-reported`, `Cognitive - QnA Maker`, `Cognitive - Language`
Milestone:	-

kristapratico commented 1 year ago

Hey @auroramariatumminello, thanks for opening the issue.

Despite reading the documentation and exploring the code, I find it very difficult to create sources in a suitable format for your code. In particular, when it comes to files and not URLs, it would be useful to pass the corpus of documents or a local path instead of a URL.

@rokulka - is it possible to add a source in this way? Can we take this as a feature request?

I have to create a SAS token for the file and pass its URL and still, it will provide problems with both .txt and .pdf ((ValidationFailure) File name must be a valid file name with a supported extension.).

@auroramariatumminello can you share a snippet of the code you've written to do this? I'm not able to reproduce the error with a simple .txt file.

auroramariatumminello commented 1 year ago

Here's the code to reproduce the Validation error:

# Function to deploy a project 
def create_project(endpoint, key, name, description, sources):
    # Create an Authoring client
    client = AuthoringClient(endpoint, AzureKeyCredential(key))
    with client:

        # Create project
        project_name = name
        project = client.create_project(
            project_name=project_name,
            options={
                "description": description,
                "language": "en",
                "multilingualResource": True,
                "settings": {
                    "defaultAnswer": None
                }
            })

        print("Project created.")

        # Update sources (REQUIRED TO DEPLOY PROJECT)
        update_sources_poller = client.begin_update_sources(
            project_name=project_name,
            sources=sources
        )

        # List of all sources
        # sources = update_sources_poller.result()

        # Deploy project
        deployment_poller = client.begin_deploy_project(
            project_name=project_name,
            deployment_name="production" # deployment name MUST always be "production"
        )
        deployment = deployment_poller.result()
        print(f"Deployment successfully created under {deployment['deploymentName']}.")

sources = [{
            "op": "add",
            "value": {
                "displayName": "Example document",
                "sourceUri": "https://download.microsoft.com/download/2/9/B/29B20383-302C-4517-A006-B0186F04BE28/surface-pro-4-user-guide-EN.pdf",
                "sourceKind": "file"
            }
}]

create_project(endpoint, key, "File-example", "Example with a file as source", sources)

HttpResponseError: (InvalidArgument) Invalid input. See details.
Code: InvalidArgument
Message: Invalid input. See details.
Exception Details:  (ValidationFailure) 'Source' must not be empty.
    Code: ValidationFailure
    Message: 'Source' must not be empty.
    Target: Update Source List[0].Value.Source

I've also tried with a .txt file: text file, you just have to replace the sourceUri with this link. It provides the same error.

kristapratico commented 1 year ago

@auroramariatumminello thanks for sharing your code, I was able to reproduce the error. For the begin_update_sources call to be successful, you must pass the source which will be the name of the file. For example, this code works for me:

sources = [{
            "op": "add",
            "value": {
                "displayName": "Example document",
                "sourceUri": "https://download.microsoft.com/download/2/9/B/29B20383-302C-4517-A006-B0186F04BE28/surface-pro-4-user-guide-EN.pdf",
                "sourceKind": "file",
                "source": "surface-guide.pdf"
            }
}]

Let me know if that resolves the issue you're seeing.

@rokulka - source is described as optional in swagger spec. Is this not the case?

ghost commented 1 year ago

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

sehajduggal commented 1 year ago

@kristapratico , source is not optional, it is a required parameter.

kristapratico commented 1 year ago

@sehajduggal I've opened a PR to correct source to be required in the swagger here: https://github.com/Azure/azure-rest-api-specs/pull/22065 Please take a look.

auroramariatumminello commented 1 year ago

@kristapratico your example works for me, but I should add as source the content of a .txt inside the blob storage. According to your example, the file should be reached through a URL. To my knowledge, I should request a token for each file inside the blob and give that URL as the source for the project. Is it possible to feed the model directly with a long text instead? The only reason I'm deploying a project is to question long texts, which is not possible with your client.get_answers_from_text().

kristapratico commented 1 year ago

@auroramariatumminello Yes, the file can be reached through the blob SAS url. You can use the source parameter to name the source, so something like this:

sources = [{
            "op": "add",
            "value": {
                "displayName": "Example document",
                "sourceUri": "https://<storage-account-name>.blob.core.windows.net/<my-container>/<myfile.txt>?sp=r&st=2023-01-09T19:30:15Z&se=2023-01-10T03:30:15Z&spr=https&sv=2021-06-08&sr=b&sig=<sas-token>",
                "sourceKind": "file",
                "source": "myfile.txt"
            }
}]

Is it possible to feed the model directly with a long text instead? The only reason I'm deploying a project is to question long texts, which is not possible with your client.get_answers_from_text().

The prebuilt API aka get_answers_from_text() would be the way to use text directly. It does have document length limitations like you mentioned. One workaround you could try is to break the text into smaller chunks and send them as multiple documents:

question="How long does it take to charge surface?"
input = AnswersFromTextOptions(
    question=question,
    text_documents=[
        "document 1 - max 5120 chars long",
        "document 2 - max 5120 chars long",
        "document 3 - max 5120 chars long",
        "document 4 - max 5120 chars long",
        "document 5 - max 5120 chars long",
    ]
)
output = client.get_answers_from_text(input)

auroramariatumminello commented 1 year ago

@kristapratico thank you for your suggestions, I'll try with the blob one. May I ask you for some suggestions on how to request the SAS token for each file in the blob automatically?

When it comes to AnswersFromTextOptions, it is interesting, but unfortunately the documents I need to process contain more than 80k chars and the object supports a maximum of 5 items.

auroramariatumminello commented 1 year ago

@kristapratico if I try to provide a PDF as input with the following code:

filename = "dev/file.pdf"
sources = [{
            "op": "add",
            "value": {
                "displayName": "Example document",
                "sourceUri": "https://{storage}.blob.core.windows.net/{container}/{filename}?sp=r&st=2023-01-10T12:47:09Z&se=2023-01-10T20:47:09Z&spr=https&sv=2021-06-08&sr=b&sig={sas}".format(storage=storage, container=container, filename=filename, sas=sas),
                "sourceKind": "file",
                "source": filename
            }
}]

I obtain:

HttpResponseError: (InvalidArgument) Please verify azure search service is up.
Code: InvalidArgument
Message: Please verify azure search service is up.

While with .txt files, it seems that the file format is not supported:

HttpResponseError: (InvalidArgument) Invalid input. See details.
Code: InvalidArgument
Message: Invalid input. See details.
Exception Details:  (ValidationFailure) File name must be a valid file name with supported extension.
    Code: ValidationFailure
    Message: File name must be a valid file name with supported extension.
    Target: Add.Files[0]

Additionally, I wanted to ask you if there is official documentation about it, because I've only found REST API update-source but it is not exhaustive about the format of sources. Furthermore, I don't understand the redundancy of specifying a URL with sourceUri and a filename through source.

kristapratico commented 1 year ago

@kristapratico thank you for your suggestions, I'll try with the blob one. May I ask you for some suggestions on how to request the SAS token for each file in the blob automatically?

By automatically do you mean programmatically? You can use the azure-storage-blob client library to generate a SAS token for the blob, here are the docs: https://azuresdkdocs.blob.core.windows.net/$web/python/azure-storage-blob/12.14.1/azure.storage.blob.html#azure.storage.blob.generate_blob_sas

@kristapratico if I try to provide a PDF as input with the following code:

Interesting. I was able to reproduce these errors when using a filename that contained the folder path like "dev/file.pdf" in your example. If I remove the path in the name and just use "file.pdf" then both of your examples work successfully and upload the source. Can you give that a try and let me know if it works for you too?

That being said, the error message is not very good. @sehajduggal - can we improve the error message the service returns if the filename is not what the service expects?

Additionally, I wanted to ask you if there is official documentation about it, because I've only found REST API update-source but it is not exhaustive about the format of sources. Furthermore, I don't understand the redundancy of specifying a URL with sourceUri and a filename through source.

Super fair on both accounts. This client library is generated directly from the swagger spec so I will open issues regarding improving documentation for source and sourceUri so that we can get this updated in the client library. In the meantime, @sehajduggal are you able to provide an exhaustive list for source formats and explain the difference between source/sourceUri?

auroramariatumminello commented 1 year ago

By automatically do you mean programmatically? You can use the azure-storage-blob client library to generate a SAS token for the blob, here are the docs: https://azuresdkdocs.blob.core.windows.net/$web/python/azure-storage-blob/12.14.1/azure.storage.blob.html#azure.storage.blob.generate_blob_sas

Yes, that's what I meant, thank you.

Interesting. I was able to reproduce these errors when using a filename that contained the folder path like "dev/file.pdf" in your example. If I remove the path in the name and just use "file.pdf" then both of your examples work successfully and upload the source. Can you give that a try and let me know if it works for you too?

I thought it was the path too, so I removed it and retried and I obtained two different errors:

HttpResponseError: (InvalidArgument) Please verify azure search service is up.
Code: InvalidArgument
Message: Please verify azure search service is up.

or an error associated with the document itself, despite it was then loaded and visualizable in language studio:

HttpResponseError: Operation returned an invalid status 'OK'
Content: {
  "createdDateTime": "2023-01-11T14:39:24+00:00",
  "expirationDateTime": "2023-01-11T20:39:24+00:00",
  "jobId": "bf2a2794-baa5-49f2-b830-d75db443c50d",
  "lastUpdatedDateTime": "2023-01-11T14:39:26+00:00",
  "status": "failed",
  "errors": [
    {
      "code": "BadArgument",
      "message": "Cannot deploy an empty Project."
    }
  ]
}

This last error seems to be generated from deployment = deployment_poller.result():

deployment_poller = client.begin_deploy_project(
            project_name=project_name,
            deployment_name="production" # deployment name MUST always be "production"
        )
deployment = deployment_poller.result()

Moreover, it seems like the .txt files are not supported, is it correct? I have no problems in providing the original PDFs, but having already the textual content of a file, wouldn't be more efficient for the QnA to accept mere text too?

auroramariatumminello commented 1 year ago

Update: I've tried with a pdf document and after removing deployment = deployment_poller.result() it seems to deploy the project correctly. I can see it and test it through Language Studio, but if I call client.get_answers() repeating the very same question tested with the UI, I get 'No good match found in KB'. Initially, I thought it was because of the confidence threshold, but after decreasing it to 0 I still get the same output.

New Update: In order to get the answer from a deployed project, without customizing the QnA model, you need to set the deployment_name argument as "test" and NOT "production", as mandatorily indicated when deploying a project

kristapratico commented 1 year ago

@akhator01 - can you help answer a few questions about the QA authoring service?

Moreover, it seems like the .txt files are not supported, is it correct? I have no problems in providing the original PDFs, but having already the textual content of a file, wouldn't be more efficient for the QnA to accept mere text too?

Can you provide a list of the valid source formats for QA?

In order to get the answer from a deployed project, without customizing the QnA model, you need to set the deployment_name argument as "test" and NOT "production", as mandatorily indicated when deploying a project

When should "test" or "production" be specified?

auroramariatumminello commented 1 year ago

Can you provide a list of the valid source formats for QA?

@kristapratico @akhator01 According to Knowledge Bases Limits, the supported formats should be ".tsv", ".pdf", ".txt", ".docx", ".xlsx", but still I'm not able to use .txt files as source, despite the only thing changing from pdf source (which works) is the extension.

When should "test" or "production" be specified?

The documentation specifies "production", the same one when deploying the project, but I've found the use of "test" in QnA Python SDK README, but still, their distinction is not clear.

decadance-dance commented 1 year ago

Hey,

I've faced the similar issue. I want to update source using local files (.docx in my case). I know it works well with urls and files in the blob storages. But I don't see how to do it using .begin_update_sources method.


file_path = "./test.docx"
sources = [{
            "op": "add",
            "value": {
                "displayName": "name1",
                "sourceUri": file_path ,
                "sourceKind": "file",
                "source": "test.docx"
            }
}]

update_sources_poller = client.begin_update_sources(
            project_name=project_name,
            sources=sources
        )

I've got error:

azure.core.exceptions.HttpResponseError: (InvalidArgument) Please verify azure search service is up.
Code: InvalidArgument
Message: Please verify azure search service is up.

kristapratico commented 1 year ago

@auroramariatumminello @decadance-dance I'm working to get answers to your questions about updating sources. Thanks for being patient.

Based on how the REST API is currently designed, I do not believe it is possible to pass a local path to a file to update a source. This will need to be a blob SAS URL to the file. I can pass this on as a feature request to the responsible team.

decadance-dance commented 1 year ago

I can pass this on as a feature request to the responsible team.

@kristapratico I would be very grateful, because this feature was implemented in the outdated QnA Maker and I would like this feature to be added to the new API.

auroramariatumminello commented 1 year ago

@kristapratico I was wondering if it is possible to get the start and end offset of the answer, just to know where the answer has been found with numerical indexes (a sort of span). I tried to play with the KnowledgeBaseAnswer object, but I don't think there is such attribute. Am I correct?

kristapratico commented 1 year ago

Hey @auroramariatumminello my understanding is that with the KnowledgeBaseAnswer, an answer span is only returned by the service for a "short answer", but not the entire answer text. So if the question is "how long does it take to charge a Surface?" and the answer is "It takes two to four hours to charge the Surface Pro 4 battery fully from an empty state. It can take longer if you’re using your Surface for power-intensive activities like gaming or video streaming while you’re charging it", the short answer might be "two to four hours" and that is what would be represented in the AnswerSpan on the KnowledgeBaseAnswer.short_answer.

Mukundahsd commented 1 year ago

@kristapratico @auroramariatumminello Summarized questions and answers for this thread are below.

Can a local file be used to create a source instead of a blob SAS URL by the API? If not, can we consider this as a feature-request?
What is the difference between source and sourceUri? https://github.com/Azure/azure-rest-api-specs/blob/main/specification/cognitiveservices/data-plane/Language/stable/2021-10-01/questionanswering-authoring.json#L1648 Why do both need to be passed?
What are the supported formats for a source? (+can we add this to the docs?)
When deploying a project and providing a deployment name, when should “test” or “production” be used?

Answers below :

local file cannot be used as we don't accept file content in the request body. We either need to provide JSON input or provide fileUri parameter in the request body. We have an example for this in our documentation.
Source parameter is the unique identifier for the input source and SourceUri is the Uri of file/url . We need both as they serve different purposes. Also, SourceDisplayName parameter is a friendly editable name for the source.
format=json/tsv/excel (Yes we can add it to the docs)
by default a "test" deployment is created for a new project. So, when deploying a project only "production" can be used. Other Language services offer custom deployment names, but CQA only allows one fixed deployment name.

kristapratico commented 1 year ago

Thanks for the clarification, @Mukundahsd.

format=json/tsv/excel (Yes we can add it to the docs)

Quick question: according to the docs, these formats (json/tsv/excel) are for importing/exporting a project. For just adding a source, the docs show these formats: ".tsv", ".pdf", ".txt", ".docx", ".xlsx". We were seeing some errors with trying to upload a .txt source so want to confirm.

auroramariatumminello commented 1 year ago

@Mukundahsd Thank you for the answers, it clear now. I was wondering if it is possibile to provide alternate questions when making a request, as suggested by your article.

Moreover, how would you suggest to provide textual contents to the QnA when deploying a project, avoiding loading heavy PDFs in language studio, given that .txt is not an acceptable format?

auroramariatumminello commented 1 year ago

@kristapratico @Mukundahsd I've tried again with .txt, this time with your UI on Language Studio, but I get this error:

BadArgument
Failed to extract QnAs from the source
 - ExtractionFailure: Unsupported / Invalid url(s). Failed to extract Q&A from the source

kristapratico commented 1 year ago

@auroramariatumminello would you be able to share the txt file that reproduces the error?

auroramariatumminello commented 1 year ago

@kristapratico I can't share the actual file I used since it's sensitive data, but I've tried with a sample text and still I have the same issue. Sometimes when loading a txt it doesn't return any error by language studio, but it is not visible in the list of sources. Here's an example of txt I've tried:

Behold, thou art fair, my love; behold, thou art fair; thou hast doves' eyes within thy locks: thy hair is as a flock of goats, that appear from mount Gilead.
Thy teeth are like a flock of sheep that are even shorn, which came up from the washing; whereof every one bear twins, and none is barren among them.
Thy lips are like a thread of scarlet, and thy speech is comely: thy temples are like a piece of a pome-granate within thy locks.
Thy neck is like the tower of David builded for an armoury, whereon there hang a thousand bucklers, all shields of mighty men.
Thy two breasts are like two young roes that are twins, which feed among the lilies.
Until the day break, and the shadows flee away, I will get me to the mountain of myrrh, and to the hill of frankincense.
Thou art all fair, my love; there is no spot in thee.

Moreover, I've tried with .docx, since it's text most similar format compared to pdf, but for any question I make, the answer is the same (the first line of text), while for the pdf with same content, this doesn't happen. Can you explain why does it happen?

kristapratico commented 1 year ago

@auroramariatumminello thanks for sharing. I can reproduce it in the Language Studio where no error is returned when adding a source and it doesn't show up in the list of sources. I've reached out to @Mukundahsd to look into it.

Mukundahsd commented 1 year ago

this is not a known issue. in fact text files are supported. can you please check with the customer on the formats . you can refer to the documentation here for the structuring .

regards

Mukunda

auroramariatumminello commented 1 year ago

@Mukundahsd @kristapratico looking at the suggested documentation, it seems like txt are supported only if structured. In my case I detain semistructured documents (with newlines after the title of the paragraph for instance). Could this create some issues when loading the file as knowledge base?

Mukundahsd commented 1 year ago

the type of document(structured, semi or unstructured) determines the way question and answer pairs are extracted as it is differently formatted. while even unstructured is supported, the way extraction and answers for questions being determined are different.

auroramariatumminello commented 1 year ago

Thank you for your help. In the end it seems like Azure QA is not an appropriate choice, because

it does not extract any answers from mere textual files;
uploading pdfs does not work in all cases, such as for scanned documents. In the future it would probably be more useful to update the QA functionalities such that, in case of unstructured documents, a GPT model comes in and still returns you the most valuable answer possible given the question. Again, thank you for your patience and your time.

Azure / azure-sdk-for-python

Project creation: update sources in a different way #27998