ExamineX search not returning pdf results

robertr8 commented 1 year ago

I am using examineX with Umbraco version 10.1.1. I am utilizing azure blob storage and currently my search is working fine. However, it does not search my media files and I don't get any pdf or document results. I downloaded ExamineX.AzureSearch.Umbraco.BlobMedia package Version="4.0.0-beta.1" like it says in the documentation.

Is there something else I am missing to search media and get results?

Shazwazza commented 1 year ago

Hi,

I assume you have configured the Umbraco.StorageProviders.AzureBlob package for Umbraco already?

The way this works is when a media item is saved, ExamineX will populate the Blob with some extra metadata which is required for the Azure Search indexer to associate the analyzed file content with the index item for that node Id.

So you will need to re-save your media items. It would be possible to create a utility that goes and re-adds all the missing Blob metadata for your media too but that hasn't been created. I can share the code for the media item saved handler which you could use to write such a utility if you wanted to avoid having to re-save all of your media?

Also be sure to check your logs since you may encounter this issue https://github.com/SDKits/ExamineX/issues/72 (there is a work around posted there too). But unfortunately, due to the way Azure Search works, if any of your media items don't contain the NodeId/xNodeId metadata, the Azure Search indexer will fail to run because it thinks the data source (the /media Blob folder) is not a valid data source.

robertr8 commented 1 year ago

Hi,

Yes, I have the Umbraco.StorageProviders.AzureBlob configured.

I checked in my azure blob storage and media items are populated with Metadata xNodeId and NodeId. However, my search doesn't turn up any media results.

Do I need two search queries? My Search Query is:

IEnumerable<string> ids = Array.Empty<string>();
            if (!string.IsNullOrEmpty(query) && _examineManager.TryGetIndex(Constants.UmbracoIndexes.ExternalIndexName, out IIndex? index))
            {

                ids = index
                    .Searcher
                    .CreateQuery("content")
                    .GroupedNot(new string[] { "umbracoNaviHide" }, new string[] { "1" })
                    .Or().GroupedOr(new[] { "headerTitle" }, terms.Boost(10))
                    .Or().GroupedOr(new[] { "textArea" }, terms.Boost(8))
                    .Or().GroupedOr(new[] { "textArea2" }, terms.Boost(8))
                    .Or().GroupedOr(new[] { "contactUsSection" }, terms.Boost(6))
                    .And().GroupedOr(new string[] { "nodeName", "headerTitle", "textArea", "contactUsSection", "carousel", "carousel2", "textArea2", "headerTitle", "contactInformation", "contactForm", "contentSideDepartmentHead", "contentSideContactUsSection", "contentMainTextArea", "contentMainCarousel1", "contentMainCarousel2", "contentMainCodeRedRegistration", "table", "accordionContent", "submitFormSection" }, terms.Fuzzy())
                    .Execute()
                    .Select(x => x.Id);

            }

            foreach (var id in ids)
            {
                yield return _umbracoHelper.Content(id);
            }

Shazwazza commented 1 year ago

Your query above is only searching on 'content', not 'media' (i.e. .CreateQuery("content")). You media item's PDF/Office doc content will get added to the media index for that item. You can test to ensure that this is working in the Examine Management dashboard and search for a media Id that has a document, if it's working, it will show that there is a content field for that media item with the analyzed document information.

Also note that you can limit the fields returned from Examine using the SelectField or SelectFields method on the IOrdering interface to improve performance (since you are only requiring the ID).

Your query has a lot of boolean logic which I think is not doing what you are intending. Also note, that any Not queries must come at the very end of your query, Not logic is a filter with Lucene/Examine whereby it will reduce whatever your are searching on (Lucene doesn't work the same as SQL). I'd suggest doing a ToString() on your IBooleanOperation before you execute it to see what the actual Lucene query is. Doing several Or's with an And and a Not might lead to some interesting results.

Please also check your logs to ensure you are not getting the error mentioned in https://github.com/SDKits/ExamineX/issues/72

robertr8 commented 1 year ago

Is there a way to search both "content" and "media" in one search query? Currently our query is searching content is there a way to include media in that query?

In the Examine Management dashboard there is no content field. Is this because of my search query, my azure blob storage, or something else?

I checked my logs and didn't have the error mentioned in #72

Shazwazza commented 1 year ago

Is there a way to search both "content" and "media" in one search query? Currently our query is searching content is there a way to include media in that query?

sure, just don't specify a category, so do .CreateQuery() instead of .CreateQuery("content")

In the Examine Management dashboard there is no content field. Is this because of my search query, my azure blob storage, or something else?

I'm not sure what you are searching on but if you search for a media item by Id with Lucene syntax like:

__NodeId: 123

where 123 is a media item with a file like a PDF, then you can click the result and it will show you all fields for that item.

I checked my logs and didn't have the error mentioned in https://github.com/SDKits/ExamineX/issues/72

That's good, do you have any other errors that might relate to this?

Can you try creating a new media item with a PDF and see if it works? It can take up to 5 minutes for Azure to process the file and add to the index. In some cases PDFs cannot be read either so please be sure it's a valid PDF with text and not just images. You can also check in the Azure Portal in your search service under 'Indexers' to see if they are running. There should be a green check mark and a value for Last run. You can click on an indexer too and it will show you the run schedule and you can verify it is running - this is what analyzes the files in the background, extracts text from files and updates the index. For example:

robertr8 commented 1 year ago

Yes, in umbraco backoffice in Examine Management I searched for 1070 which is the __NodeId of one of my pdfs and I clicked on the fields for that item and don't have a content field. Screenshot (77)

Shazwazza commented 1 year ago

Hi,

Yes this sounds exactly like the issue in https://github.com/SDKits/ExamineX/issues/72.

If ANY of your files in your blob container don't have the correct metadata, the indexer wont execute (as described in #72). You will probably have errors in your logs like what is shown there

{"error":{"code":"","message":"Data source does not contain column '__NodeId', which is required because it maps to the document key field 'x__NodeId' in the index 'test-external'. Ensure that the '__NodeId' column is present in the data source, or add a field mapping that maps one of the existing column names to 'x__NodeId'."}}

You can always edit the JSON value of your indexer in the azure portal and change disabled to false, when you save it will show you why you cannot enable it - which I'm assuming is the above error.

There's a workaround as described in #72 to change your data source to include the media prefix. But this will still mean that all of your files in /media in your blob storage have the metadata applied.

Please also see the related bug with additional info https://github.com/SDKits/ExamineX/issues/68

robertr8 commented 1 year ago

Hi,

Yes sounds like you're right. I changed my blob folder to media last week when you linked #72 but I did not realize my indexers were disabled until your above comment.

After I enabled my indexers I was able to see the content field and it contained the extracted text from the files.

I appreciate all your help!

I have one last question - now that I am getting the content field in my search query how do I return that field so I can display it in my search results?

Per umbraco's documentation before I was returning my search results as foreach (var id in ids) { yield return _umbracoHelper.Content(id); }

I realize this won't work with media files, but haven't found a way to return the content field which contains my pdf text. Please let me know if there is a recommended way of returning this so I can display it on my search page.

Thanks!

Shazwazza commented 1 year ago

Good news :) Also note that whenever you re-save a media item, it will attempt to re-enable your indexers if they are disabled (after about a minute).

I have one last question - now that I am getting the content field in my search query how do I return that field so I can display it in my search results?

Per umbraco's documentation before I was returning my search results as foreach (var id in ids) { yield return _umbracoHelper.Content(id); }

I realize this won't work with media files, but haven't found a way to return the content field which contains my pdf text. Please let me know if there is a recommended way of returning this so I can display it on my search page.

I guess I need a little context about what you are doing. If you are searching media items and want to convert these to IPublishedContent - which is what _umbracoHelper.Content(id) is doing, then instead you would have to do _umbracoHelper.Media(id). You can determine from the Examine search result what category the search result item is, either 'content' or 'media' and then you can determine which method to use.

robertr8 commented 1 year ago

Yes, thank you :)

We have a search page and want to be able to search content and media and display those results. For our content we display the title of a page and some of the text on that page. That works.

We want to do something similar with our media results and display the name of the pdf and some of the text on the pdf.

We did try the _umbracoHelper.Media(id) but this did not return the content field that contains pdf text

Shazwazza commented 1 year ago

Right, I see what you mean. That is because content isn't a media type field and the content of the PDF isn't saved back to Umbraco, it's just added to the index to be searched on. What you can do is map your media search results to a Tuple or custom object so that you have both the examine search result and the IPublishedContent for the content/media item to use to display information.

For example (pseudo code):

var searchResults = GetSearchResults(...);

var resultsWithPublishedContent = searchResults
   .Select(searchResult => searchResult == "media"
           // return a tuple of both the search result and the IPublishedContent
       ? (searchResult, _umbracoHelper.Media(x.Id))
       : (searchResult, _umbracoHelper.Content(x.Id)));

foreach (var searchResultWithPublishedContent in resultsWithPublishedContent)
{
    var examineResult = searchResultWithPublishedContent.searchResult;
    var publishedContent = searchResultWithPublishedContent.Item2;

    // now you can do stuff with both the examine result and it's IPublishedContent
}

robertr8 commented 1 year ago

Thank you for all your help! :)

My team and I got our search page working displaying media results.

Shazwazza commented 1 year ago

Thats great :)

SDKits / ExamineX

ExamineX search not returning pdf results #73