Azure-Samples / search-qna-maker-accelerator

Cognitive Search Question Answering Solution Accelerator
Other
36 stars 16 forks source link

TIF File Support #16

Open jrohnerx opened 3 years ago

jrohnerx commented 3 years ago

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [X ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Add .tif to JSON for Indexer

Any log messages given by the failure

N/A

Expected/desired behavior

Being able to OCR and search/index .TIF images

OS and Version?

Windows 10.

Versions

Mention any other details that might be useful

I was hoping to add .tif support to this as we have thousands of TIF files with text that could easily be OCR'd and search on. Is this possible? I added the .tif file format to the Indexer json as mentioned in the guide/instructions, but it doesn't seem to capture those files in the search results after waiting a while.


Thanks! We'll be in touch soon.

jennifermarsman commented 3 years ago

There are two separate things that I want to address here: the feature request and the .tif files not being captured in the search results that you mentioned at the end.

Tif files not being captured in the search results: It sounds like you added .tif file format to the list of indexedFileNameExtensions, but you still aren't seeing them in your search results. Here's what I suspect happened. If you have already run the indexer once on your blob storage, the indexer keeps track of a high water mark (meaning the timestamp of when it ran and what it's processed, so it knows everything created before that is already processed and in the index). So, I'm guessing that the high water mark is already at a time past when all of your tifs were added to blob storage, and that is why the indexer isn't picking them up. So, there are two ways to deal with this:

  1. You could reset your indexer and then rerun it. That will re-process all files. If your collection of files is pretty small, this is a simple, easy solution.
  2. However, if you have a large collection of files already in the index, you likely don't want to pay to reprocess them. That is why we introduced a new feature to "touch" certain files only and then it will rerun only those certain files. This new feature is called Reset Docs. You would want to call the ResetDocs API and specify the document keys for the TIF files only. Then run your indexer again and it should pull only those TIF files through.

Feature request: I'm not sure if this is do-able, but it will require a little work. I won't have time to get to this in the short term, so let me explain how I would do it and you can take a stab at it.

Now, the part that I need to check with the QnA Maker team on is whether that output can be processed by QnA Maker correctly. Their old model needed structure around the question and answer pairs, and I'm not sure if the OCR skill output would have that structure. I think they have a newer model where less structure is needed, but let me verify.

jrohnerx commented 3 years ago

Thank you, @jennifermarsman, for the response and information. I think given the circumstances, my best bet is to run scripts to convert our TIF files to OCR'd PDFs (done via our document management system) and output them to a share where the indexer can pick them up.

Right now I have the legwork for this done, but unfortunately, the files are in an Azure File Share which isn't a publicly supported index location at this time. I've heard that it's in a form of a closed beta/test. Do you know who is in charge of allowing customers to participate in that?

jennifermarsman commented 3 years ago

@jrohnerx yes, it's in preview. Drop me an email at jennmar@microsoft.com and I can hook you up.