Closed mullala1 closed 7 years ago
Sorry you are having problems. Do you think you could start a PR with a broken test with that .msg file behaving poorly?
On Fri, Sep 9, 2016 at 1:55 PM mullala1 notifications@github.com wrote:
Hi,
Tikaondotnet seems to have an issue parsing .msg files. At first it seemed like it was stuck in an infinite loop somewhere, but it does usually return after a while. I downloaded the code and stepped into the TextExtractor class in an attempt to figure it out -- the parse method does return; the issue appears to be with closing the inputStream:
using (var inputStream = streamFactory(metadata)) { try { parser.parse(inputStream, getTransformerHandler(outputWriter), metadata, parseContext); } finally { inputStream.close(); } }
I've tried several different .msg files, with the same result. Those same files do work via the Tika GUI, which is using the same tika-app-1.13.jar under the hood.
Any idea if there is a way around this? Tikaondotnet is awesome and I'd love to make use of it if there's a way to get the .msg files working. I've attached a (zipped) .msg sample to help reproduce the issue.
Many thanks, Andrew emailTest.zip https://github.com/KevM/tikaondotnet/files/464616/emailTest.zip
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/KevM/tikaondotnet/issues/63, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAGHcXzloCT43p7-3a9RAjrSs3Tagk4ks5qoauRgaJpZM4J5Wee .
I'll have to go figure out how to do that -- I'm fairly new to git/github. In my cloned copy, I added the test below -- strangely enough, it seems to work ok if you actually run the test, but if you step into it using the debugger, it hangs for up to a minute or so. There must be some underlying issue though, as that was the same behavior I initially experienced when trying it out via the nuget package.
[Test]
public void should_extract_from_msg()
{
var textExtractionResult = _cut.Extract("files/emailTest.msg");
// Deliberately fail test for now, since it does eventually return with data
textExtractionResult.Text.Should().Contain("DELIBERATELY FAILED");
}
Think I've figured out how to do it...I'll give it a shot.
If you want to speed up reading MSG's ... I have made and MSG reader some while back... you can find it over here --> https://github.com/Sicos1977/MSGReader
I think we should edit this test so that it uses a stop watch to fail it properly. Maybe I could throw a profiler on this to see what is taking so long?
I finally tried out this test and while it is not super fast it takes about 2 seconds on my laptop:
Test 'TikaOnDotNet.Tests.sad_text_extraction.issue_63' failed: Expected a value less than 2s, but found 2.224s.
I'll try to take a look at this in a profiler. But my expectation is that I won't be able to make it any faster.
Thanks for looking at it. That's probably as good as we're going to get. I haven't looked at this in a while, but I think the extreme slowness (i.e. 30-60 seconds or more) was caused by some interaction between Tika when run via IKVM and the Visual Studio debugger process. When running without the debugger, it seems to perform normally (as in your test above). We could probably close this issue since at runtime it seems to work ok.
Hi,
Tikaondotnet seems to have an issue parsing .msg files. At first it seemed like it was stuck in an infinite loop somewhere, but it does usually return after a while. I downloaded the code and stepped into the TextExtractor class in an attempt to figure it out -- the parse method does return; the issue appears to be with closing the inputStream:
I've tried several different .msg files, with the same result. Those same files do work via the Tika GUI, which is using the same tika-app-1.13.jar under the hood.
Any idea if there is a way around this? Tikaondotnet is awesome and I'd love to make use of it if there's a way to get the .msg files working. I've attached a (zipped) .msg sample to help reproduce the issue.
Many thanks, Andrew emailTest.zip