Open slacto opened 2 years ago
You can use this class as a drop-in solution to index pdf files - it uses PdfSharp and PdfSharpTextExtractor
public class PdfContentSearchExtension : ISearchDocumentBuilderExtension
{
public void Populate(SearchDocumentBuilder searchDocumentBuilder, IData data)
{
if (!(data is IMediaFile mediaFile))
{
return;
}
if (searchDocumentBuilder.TextParts.Any() && !String.IsNullOrEmpty(searchDocumentBuilder.Url))
{
return;
}
var mimeType = MimeTypeInfo.GetCanonical(mediaFile.MimeType);
if (!IsIndexableMimeType(mimeType))
{
return;
}
var text = GetText(mediaFile);
if (String.IsNullOrWhiteSpace(text))
{
return;
}
searchDocumentBuilder.TextParts.Add(text);
searchDocumentBuilder.Url = MediaUrls.BuildUrl(mediaFile, UrlKind.Internal);
Log.LogInformation("PdfContentSearchExtension", $"{mediaFile.FileName} indexed successfully");
}
private static string GetText(IMediaFile mediaFile)
{
var sb = new StringBuilder();
using (var pdfDocument = PdfReader.Open(mediaFile.GetReadStream(), PdfDocumentOpenMode.ReadOnly))
{
var extractor = new Extractor(pdfDocument);
foreach (var page in pdfDocument.Pages)
{
extractor.ExtractText(page, sb);
sb.AppendLine();
}
}
return sb.ToString();
}
private static bool IsIndexableMimeType(string mimeType)
{
return mimeType == "application/pdf";
}
}
Just register it in your startup handler like this
public static void ConfigureServices(IServiceCollection serviceCollection)
{
serviceCollection.AddSingleton<ISearchDocumentBuilderExtension>(new PdfContentSearchExtension());
Log.LogInformation("Searching", "PdfContentSearchExtension registered");
}
Fantastic... Thanks, it works!
It sounds like Orckestra.Search.MediaContentIndexing cannot be used at all on Azure. Is that correct?
Since all the indexing of MediaContentIndexing relies on using the IFilter interface it must be safe to assume that it can't do any indexing on Azure.
The above code was even made for a regular Windows server - i believe IFilter has been depricated for many years - not only on Azure.
That leaves us with docx and other types of non-pdfs not being indexed and searchable without writing custom code for that too.
It should be fairly easy though to replace PdfSharpTextExtractor with TikaOnDotnet.TextExtractor which is a library that on paper supports a various of formats
When rebuilding the search index on a site that I recently moved from IIS to Azure, I get a lot of warnings regarding PDF files. "Failed to parse the content of the media file 'x.pdf'. IFilter not found for the given file extension."
IFilter is not supported on Azure web apps. If I google it I get a lot of SiteCore results. It seems SiteCore have moved away from IFilter for the same reason. The question is whether Orchestra has a solution, or whether we should create our own solution, e.g. by doing the same as SiteCore which uses pdfsharp to extract text from pdf documents and then index it.