TIF File Support - Githubissues

jrohnerx commented 3 years ago

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [X ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Add .tif to JSON for Indexer

Any log messages given by the failure

N/A

Expected/desired behavior

Being able to OCR and search/index .TIF images

OS and Version?

Windows 10.

Versions

Mention any other details that might be useful

I was hoping to add .tif support to this as we have thousands of TIF files with text that could easily be OCR'd and search on. Is this possible? I added the .tif file format to the Indexer json as mentioned in the guide/instructions, but it doesn't seem to capture those files in the search results after waiting a while.

Thanks! We'll be in touch soon.

jennifermarsman commented 3 years ago

There are two separate things that I want to address here: the feature request and the .tif files not being captured in the search results that you mentioned at the end.

Tif files not being captured in the search results: It sounds like you added .tif file format to the list of indexedFileNameExtensions, but you still aren't seeing them in your search results. Here's what I suspect happened. If you have already run the indexer once on your blob storage, the indexer keeps track of a high water mark (meaning the timestamp of when it ran and what it's processed, so it knows everything created before that is already processed and in the index). So, I'm guessing that the high water mark is already at a time past when all of your tifs were added to blob storage, and that is why the indexer isn't picking them up. So, there are two ways to deal with this:

You could reset your indexer and then rerun it. That will re-process all files. If your collection of files is pretty small, this is a simple, easy solution.
However, if you have a large collection of files already in the index, you likely don't want to pay to reprocess them. That is why we introduced a new feature to "touch" certain files only and then it will rerun only those certain files. This new feature is called Reset Docs. You would want to call the ResetDocs API and specify the document keys for the TIF files only. Then run your indexer again and it should pull only those TIF files through.

Feature request: I'm not sure if this is do-able, but it will require a little work. I won't have time to get to this in the short term, so let me explain how I would do it and you can take a stab at it.

Add .tif to the file formats in the indexedFileNameExtensions in Indexer.json. It sounds like you did this already.
Modify your skillset to include OCR.
The QnA Maker service does not accept tif file format, so the current code path won't work as we are just uploading the files directly. In QnAIntegrationCustomSkill.cs, you can see that in the method RunUploadToQnaMaker we are putting the list of files on a queue to be processed, and in the method Run, we actually do the upload to QnAMaker, uploading the files directly. So, you will need to write the output of the OCR Skill (which is being held in memory) out to a file, and then upload this file. Some things that might help here:
- The Conditional Skill could help you get just .tif files if needed.
- The Knowledge Store and Shaper Skill could help you write out the results of the OCR Skill to blob storage.
- Then you will need to modify our code to ensure that these new files in blob storage are added to the queue to be ingested into QnAMaker.
- The documents need to be in a certain format and file type to be properly processed by QnA Maker. See here and here for more information. So you might also need a Custom Skill hosted in an Azure Function to properly parse and format the OCR Skill output into something that QnA Maker can process, and then perhaps write it out to blob storage in one of the supported file formats directly from there (instead of using Shaper Skill).

Now, the part that I need to check with the QnA Maker team on is whether that output can be processed by QnA Maker correctly. Their old model needed structure around the question and answer pairs, and I'm not sure if the OCR skill output would have that structure. I think they have a newer model where less structure is needed, but let me verify.

jrohnerx commented 3 years ago

Thank you, @jennifermarsman, for the response and information. I think given the circumstances, my best bet is to run scripts to convert our TIF files to OCR'd PDFs (done via our document management system) and output them to a share where the indexer can pick them up.

Right now I have the legwork for this done, but unfortunately, the files are in an Azure File Share which isn't a publicly supported index location at this time. I've heard that it's in a form of a closed beta/test. Do you know who is in charge of allowing customers to participate in that?

jennifermarsman commented 3 years ago

@jrohnerx yes, it's in preview. Drop me an email at jennmar@microsoft.com and I can hook you up.

Azure-Samples / search-qna-maker-accelerator

TIF File Support #16

Please provide us with the following information:

This issue is for a: (mark with an `x`)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Versions

Mention any other details that might be useful

Azure-Samples / search-qna-maker-accelerator

TIF File Support #16

Please provide us with the following information:

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Versions

Mention any other details that might be useful

This issue is for a: (mark with an `x`)