Index PDF files on Azure

slacto commented 2 years ago

When rebuilding the search index on a site that I recently moved from IIS to Azure, I get a lot of warnings regarding PDF files. "Failed to parse the content of the media file 'x.pdf'. IFilter not found for the given file extension."

IFilter is not supported on Azure web apps. If I google it I get a lot of SiteCore results. It seems SiteCore have moved away from IFilter for the same reason. The question is whether Orchestra has a solution, or whether we should create our own solution, e.g. by doing the same as SiteCore which uses pdfsharp to extract text from pdf documents and then index it.

burningice2866 commented 2 years ago

You can use this class as a drop-in solution to index pdf files - it uses PdfSharp and PdfSharpTextExtractor

public class PdfContentSearchExtension : ISearchDocumentBuilderExtension
    {
        public void Populate(SearchDocumentBuilder searchDocumentBuilder, IData data)
        {
            if (!(data is IMediaFile mediaFile))
            {
                return;
            }

            if (searchDocumentBuilder.TextParts.Any() && !String.IsNullOrEmpty(searchDocumentBuilder.Url))
            {
                return;
            }

            var mimeType = MimeTypeInfo.GetCanonical(mediaFile.MimeType);
            if (!IsIndexableMimeType(mimeType))
            {
                return;
            }

            var text = GetText(mediaFile);
            if (String.IsNullOrWhiteSpace(text))
            {
                return;
            }

            searchDocumentBuilder.TextParts.Add(text);
            searchDocumentBuilder.Url = MediaUrls.BuildUrl(mediaFile, UrlKind.Internal);

            Log.LogInformation("PdfContentSearchExtension", $"{mediaFile.FileName} indexed successfully");
        }

        private static string GetText(IMediaFile mediaFile)
        {
            var sb = new StringBuilder();

            using (var pdfDocument = PdfReader.Open(mediaFile.GetReadStream(), PdfDocumentOpenMode.ReadOnly))
            {
                var extractor = new Extractor(pdfDocument);
                foreach (var page in pdfDocument.Pages)
                {
                    extractor.ExtractText(page, sb);

                    sb.AppendLine();
                }
            }

            return sb.ToString();
        }

        private static bool IsIndexableMimeType(string mimeType)
        {
            return mimeType == "application/pdf";
        }
    }

Just register it in your startup handler like this

public static void ConfigureServices(IServiceCollection serviceCollection)
        {
            serviceCollection.AddSingleton<ISearchDocumentBuilderExtension>(new PdfContentSearchExtension());

            Log.LogInformation("Searching", "PdfContentSearchExtension registered");
        }

slacto commented 2 years ago

Fantastic... Thanks, it works!

It sounds like Orckestra.Search.MediaContentIndexing cannot be used at all on Azure. Is that correct?

burningice2866 commented 2 years ago

Since all the indexing of MediaContentIndexing relies on using the IFilter interface it must be safe to assume that it can't do any indexing on Azure.

The above code was even made for a regular Windows server - i believe IFilter has been depricated for many years - not only on Azure.

That leaves us with docx and other types of non-pdfs not being indexed and searchable without writing custom code for that too.

burningice2866 commented 2 years ago

It should be fairly easy though to replace PdfSharpTextExtractor with TikaOnDotnet.TextExtractor which is a library that on paper supports a various of formats

https://kevm.github.io/tikaondotnet/

Orckestra / C1-CMS-Foundation

Index PDF files on Azure #813