Closed bouletator closed 6 years ago
Can you transform the TextExtractionResult into your own representation? Not every user will have comma delimited data for their metadata values. On Wed, Mar 28, 2018 at 7:24 AM bouletator notifications@github.com wrote:
Hi,
I'm using your wrapper to extract metadata from multiple type of files. It's working great (so thank you for that :visage_légèrement_souriant: ) for one value metadata. But when I have multiple values for a metadata, these values are joined with ", " , potentially the same separator used by the writer of the document to separate the authors lastname and firstname (e.g. Bernal, M. A., deAlmeida, C. E. for authors Bernal, M. A. and deAlmeida, C. E.).
In the source code of your wrapper, I found that you retrieve metadata from the tika lib as a list of string and then you join them to return a string.
Could you consider changing the Metadata property of object TikaOnDotNet.TextExtraction.TextExtractionResult from a Dictionnary<string, string> to a Dictionnary<string, List> ?
I understand that it is a breaking change but I think it would let users with much more possibilities
I join an example of one of our files with multiple authors (each author is written with lastname and firstname separated by ", ").
Regards, Clement file_author.pdf https://github.com/KevM/tikaondotnet/files/1855514/file_author.pdf
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/KevM/tikaondotnet/issues/117, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAGHSxWWMUyM9cx82bxAyFWIhPYb471ks5ti4D7gaJpZM4S-i_r .
Hi KevM,
Thanks for your quick reply.
Sorry, I was not clear about what my problem is.
I did some tests with tika server and found out that the result of the metadata extraction is a list of strings whereas with TikaOnDotNet it is a string (even if the result from the c++ tika lib used by TikaOnDotNet is returning a list of strings). This is because you dovar metaDataResult = metadata.names() .ToDictionary(name => name, name => string.Join(", ", metadata.getValues(name)));
.
Do you think it could be a good enhancement to remove the join and let the users deal with the list of strings?
Let me know your thoughts.
Thanks again :)
regards, Clément
Ok sorry I had it backwards you want a string not a list. Hmm I'll investigate a bit and look at extracting the file you posted.
You are correct that this would be a breaking change. If possible, I'd rather avoid that change at the moment as it may affect others.
It looks like you are running in the problem of commas nested within the author names which in turn prevents you from easily splitting out the authors.
I feel like the best solution here is to allow you to provide a hook for the creation of your own ExtractionResult
. Something like this:
TResult AssembleExtractionResult<TResult>(Metadata metadata) {
// transform metadata into a new TResult
}
Do you have the capacity to do this? If not I can take a look sometime in the future.
Never mind. I got to thinking about this and took a pass at a fix for you. #119
Thanks a lot for your good work !
@bouletator Let me know how it goes and if you are able to use this as desired.
Yes perfectly working and thanks a lot again for this good work ! 🥇
Hi,
I'm using your wrapper to extract metadata from multiple type of files. It's working great (so thank you for that :visage_légèrement_souriant: ) for one value metadata. But when I have multiple values for a metadata, these values are joined with ", " , potentially the same separator used by the writer of the document to separate the authors lastname and firstname (e.g. Bernal, M. A., deAlmeida, C. E. for authors Bernal, M. A. and deAlmeida, C. E.).
In the source code of your wrapper, I found that you retrieve metadata from the tika lib as a list of string and then you join them to return a string.
Could you consider changing the Metadata property of object TikaOnDotNet.TextExtraction.TextExtractionResult from a Dictionnary<string, string> to a Dictionnary<string, List> ?
I understand that it is a breaking change but I think it would let users with much more possibilities
I join an example of one of our files with multiple authors (each author is written with lastname and firstname separated by ", ").
Regards, Clement file_author.pdf