KevM / tikaondotnet

Use the Java Tika text extraction library on the .NET platform
http://kevm.github.io/tikaondotnet/
Apache License 2.0
197 stars 72 forks source link

Newb - 16 down to 5 much smaller deployment? #111

Closed ghost closed 6 years ago

ghost commented 6 years ago

Newb here, using TikaDotNet to parse PDF files.

I took the blog post to imply that I could delete all the IKVM assemblies except the 5 listed.

I did so, but now TextExtractor.Extract() fails. I also see that the blog post had 16 assemblies originally, but I started with 28.

Any suggestions to reduce deployment size?

KevM commented 6 years ago

I think that post is not correct now that Tika has a lot of new code in it that needs more dependencies. Not sure what exactly is required. Maybe you could watch the assembly loader or profile the app and see what dependencies actually get invoked for your workloads.

ghost commented 6 years ago

Ok, thanks for quick response. I'll see what I can find out with a profiler.

ghost commented 6 years ago

Hey Kevin, I'm using TikaOnDotNet.TextExtractor (v.1.16.0) to extract from a PDF file.

I ran TextExtractor.Extract(string path), and saw that 17 assemblies loaded (see below) compared to the 28 in the package. I then removed all but these 17, and the extraction results were the same.

I did the same with an Excel (.xlsx) file and a Word (.docx) file, and it worked as usual with these 17.

IKVM.OpenJDK.Beans.dll IKVM.OpenJDK.Charsets.dll IKVM.OpenJDK.Core.dll IKVM.OpenJDK.Localedata.dll IKVM.OpenJDK.Media.dll IKVM.OpenJDK.Security.dll IKVM.OpenJDK.SwingAWT.dll IKVM.OpenJDK.Text.dll IKVM.OpenJDK.Util.dll IKVM.OpenJDK.XML.API.dll IKVM.OpenJDK.XML.Bind.dll IKVM.OpenJDK.XML.Parse.dll IKVM.OpenJDK.XML.Transform.dll IKVM.OpenJDK.XML.XPath.dll IKVM.Runtime.dll TikaOnDotNet.dll TikaOnDotNet.TextExtraction.dll

KevM commented 6 years ago

Thanks for letting us know. This may help out other users deploying TikaOnDotnet.

I think there is nothing we can do in this project to curtail the IKVM assemblies propagated upstream as that is managed by Nuget.

HolisticMystic commented 6 years ago

Keep it up, Kevin. This is s a very cool project.