KevM / tikaondotnet

Use the Java Tika text extraction library on the .NET platform
http://kevm.github.io/tikaondotnet/
Apache License 2.0
195 stars 73 forks source link

Returning metadata extracted as a list in place of a joined string #117

Closed bouletator closed 6 years ago

bouletator commented 6 years ago

Hi,

I'm using your wrapper to extract metadata from multiple type of files. It's working great (so thank you for that :visage_légèrement_souriant: ) for one value metadata. But when I have multiple values for a metadata, these values are joined with ", " , potentially the same separator used by the writer of the document to separate the authors lastname and firstname (e.g. Bernal, M. A., deAlmeida, C. E. for authors Bernal, M. A. and deAlmeida, C. E.).

In the source code of your wrapper, I found that you retrieve metadata from the tika lib as a list of string and then you join them to return a string.

Could you consider changing the Metadata property of object TikaOnDotNet.TextExtraction.TextExtractionResult from a Dictionnary<string, string> to a Dictionnary<string, List> ?

I understand that it is a breaking change but I think it would let users with much more possibilities

I join an example of one of our files with multiple authors (each author is written with lastname and firstname separated by ", ").

Regards, Clement file_author.pdf

KevM commented 6 years ago

Can you transform the TextExtractionResult into your own representation? Not every user will have comma delimited data for their metadata values. On Wed, Mar 28, 2018 at 7:24 AM bouletator notifications@github.com wrote:

Hi,

I'm using your wrapper to extract metadata from multiple type of files. It's working great (so thank you for that :visage_légèrement_souriant: ) for one value metadata. But when I have multiple values for a metadata, these values are joined with ", " , potentially the same separator used by the writer of the document to separate the authors lastname and firstname (e.g. Bernal, M. A., deAlmeida, C. E. for authors Bernal, M. A. and deAlmeida, C. E.).

In the source code of your wrapper, I found that you retrieve metadata from the tika lib as a list of string and then you join them to return a string.

Could you consider changing the Metadata property of object TikaOnDotNet.TextExtraction.TextExtractionResult from a Dictionnary<string, string> to a Dictionnary<string, List> ?

I understand that it is a breaking change but I think it would let users with much more possibilities

I join an example of one of our files with multiple authors (each author is written with lastname and firstname separated by ", ").

Regards, Clement file_author.pdf https://github.com/KevM/tikaondotnet/files/1855514/file_author.pdf

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/KevM/tikaondotnet/issues/117, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAGHSxWWMUyM9cx82bxAyFWIhPYb471ks5ti4D7gaJpZM4S-i_r .

bouletator commented 6 years ago

Hi KevM, Thanks for your quick reply. Sorry, I was not clear about what my problem is. I did some tests with tika server and found out that the result of the metadata extraction is a list of strings whereas with TikaOnDotNet it is a string (even if the result from the c++ tika lib used by TikaOnDotNet is returning a list of strings). This is because you dovar metaDataResult = metadata.names() .ToDictionary(name => name, name => string.Join(", ", metadata.getValues(name)));. Do you think it could be a good enhancement to remove the join and let the users deal with the list of strings?

Let me know your thoughts.

Thanks again :)

regards, Clément

KevM commented 6 years ago

Ok sorry I had it backwards you want a string not a list. Hmm I'll investigate a bit and look at extracting the file you posted.

KevM commented 6 years ago

You are correct that this would be a breaking change. If possible, I'd rather avoid that change at the moment as it may affect others.

It looks like you are running in the problem of commas nested within the author names which in turn prevents you from easily splitting out the authors.

image

I feel like the best solution here is to allow you to provide a hook for the creation of your own ExtractionResult. Something like this:

TResult AssembleExtractionResult<TResult>(Metadata metadata) { 
   // transform metadata into a new TResult
}

Do you have the capacity to do this? If not I can take a look sometime in the future.

KevM commented 6 years ago

Never mind. I got to thinking about this and took a pass at a fix for you. #119

bouletator commented 6 years ago

Thanks a lot for your good work !

KevM commented 6 years ago

@bouletator Let me know how it goes and if you are able to use this as desired.

bouletator commented 6 years ago

Yes perfectly working and thanks a lot again for this good work ! 🥇