KevM / tikaondotnet

Use the Java Tika text extraction library on the .NET platform
http://kevm.github.io/tikaondotnet/
Apache License 2.0
195 stars 73 forks source link

Add TextExtractor.Extract overloads for custom extraction results #119

Closed KevM closed 6 years ago

KevM commented 6 years ago

It was requested to change the format of the current shape of the TextExtractionResults. Rather than break backwards compatibility I've introduced new overloads which allow the user to specify their own extraction results assemblers.

public class CustomResult
{
    public string Text { get; set; }
    public IDictionary<string, string[]> Metadata { get; set; }
}

public static CustomResult CreateCustomResult(string text, Metadata metadata)
{
    var metaDataDictionary = metadata.names().ToDictionary(name => name, metadata.getValues);

    return new CustomResult
    {
        Metadata = metaDataDictionary,
        Text = text,
    };
}

[Test]
public void should_extract_author_list_from_pdf()
{
    var textExtractionResult = new TextExtractor().Extract("file_with_authors.pdf", CreateCustomResult);

    textExtractionResult.Metadata["meta:author"].Should().ContainInOrder("Fred Jones, M. D.", "Donald Evans D. M.");
}

This will close #117 when it is merged.

KevM commented 6 years ago

Also closes #115. I needed to get the CI updated for our brave new world of 2018.

KevM commented 6 years ago

@bouletator let me know if this is good for you?

@TechnikEmpire care to also do a code review?

KevM commented 6 years ago

If no one complains I'll pull in this PR later today.