Open KevM opened 2 years ago
TBD:
TikaOnDotnet.TextExtraction
should use nuspec or jave csproj properties to make the listing as nice as TikaOnDotnet
Hey I'm not a designer, but if you like it I can add in a commit to this branch.
Thank you!
Hey KevM,
Do we need to target .Net 6?
Yes we do need Tika on .Net to target .Net 6 at my organization for a couple of projects,
When do you think we could expect a new release?
There is one failing test for rtf files. No idea why it is not working. I was going to work on getting a pre-release out and then let people try it out for a bit before committing to a release.
Note: I’d be willing to take a short contract to get this release out quicker. I am self employed.
On Wed, Sep 14, 2022, at 7:01 AM, Smiechowski Nathanael wrote:
Hey KevM,
Do we need to target .Net 6?
Yes we do need Tika on .Net to target .Net 6 at my organization for a couple of projects,
When do you think we could expect a new release?
— Reply to this email directly, view it on GitHub https://github.com/KevM/tikaondotnet/pull/152#issuecomment-1246661398, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAMHPSFOCSWRMVEYDO473V6G5DLANCNFSM6AAAAAAQCWG3ME. You are receiving this because you authored the thread.Message ID: @.***>
There is one failing test for rtf files. No idea why it is not working. I was going to work on getting a pre-release out and then let people try it out for a bit before committing to a release. Note: I’d be willing to take a short contract to get this release out quicker. I am self employed. … On Wed, Sep 14, 2022, at 7:01 AM, Smiechowski Nathanael wrote: Hey KevM, > Do we need to target .Net 6? > Yes we do need Tika on .Net to target .Net 6 at my organization for a couple of projects, When do you think we could expect a new release? — Reply to this email directly, view it on GitHub <#152 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAMHPSFOCSWRMVEYDO473V6G5DLANCNFSM6AAAAAAQCWG3ME. You are receiving this because you authored the thread.Message ID: @.***>
I was able to pretty easily target net6.0 throughout and pass tests (except the RTF one) without changing any other dependency versions by incrementing version numbers and adding net6.0
to the .csproj targets.
Regarding the RTF test - it seems to pass when the RTF file doesn't contain an image. Without digging into the Java side of things I can't provide much feedback beyond that.
If you'd like, I can submit a PR for the net6 support but that'll take a bit of approval on my end as I'm using this for an internal project.
Hey ya'll. I fell into this thread while following links blindly. I revived the IKVM project.
To get Core out, and because nobody really wanted to fix it, we didn't pay any attention to AWT. So, no AWT in IKVM. My guess is this is killing your attempted usage of Java2D. I don't really know though, since I didn't do any more investigation yet besides read this thread.
The previous AWT default toolkit was IKVM.AWT.WinForms. An attempt to map the AWT stuff to WinForms. As ya'll know, WinForms is quite different in Core. And it's not cross platform anyways. So we just didn't get it building, and probably aren't going to spend any time on it.
Instead though, you can probably configure IKVM to run in headless mode, just as you would configure OpenJDK to do so. Some System property you can set.
8.2.2 will end up with headless mode enabled by default.
Somebody try that.
Also, while I'm here, I want to make ya'll aware of IKVM.Maven.Sdk.
This is a new strategy for Java-Libraries-on-DotNet.
Instead of actually distributing cross compiled JAR -> DLL files on NuGet, which is error prone, as you don't own the assembly name or code, we allow users to directly add references to Maven packages in their C# projects. That way only one authoritative source for Java libraries exist: Maven. And the author of the actual product owns those artifacts. And the end developer is the one downloading and doing the conversion, and no middle man is redistributing licensed code.
We also embed some fancy information inside produced NuGet packages describing these references to Maven. So, say you write some .NET code that uses a library from Maven. And then you Pack that. Your produced NuGet file actually has a .pom file embedded into it. When somebody installs your NuGet package, and builds their own library, IKVM.Maven.Sdk downloads the Maven artifacts and cross compiles them on the fly.
It kind of obsoletes projects like TikaOnDotNet, unless you guys provide some functionality beyond the Java library. Like extension methods, etc.
Also, while I'm here, I want to make ya'll aware of IKVM.Maven.Sdk.
This is a new strategy for Java-Libraries-on-DotNet.
Instead of actually distributing cross compiled JAR -> DLL files on NuGet, which is error prone, as you don't own the assembly name or code, we allow users to directly add references to Maven packages in their C# projects. That way only one authoritative source for Java libraries exist: Maven. And the author of the actual product owns those artifacts. And the end developer is the one downloading and doing the conversion, and no middle man is redistributing licensed code.
We also embed some fancy information inside produced NuGet packages describing these references to Maven. So, say you write some .NET code that uses a library from Maven. And then you Pack that. Your produced NuGet file actually has a .pom file embedded into it. When somebody installs your NuGet package, and builds their own library, IKVM.Maven.Sdk downloads the Maven artifacts and cross compiles them on the fly.
It kind of obsoletes projects like TikaOnDotNet, unless you guys provide some functionality beyond the Java library. Like extension methods, etc.
This is great! However, it's building extremely slowly for me. Is that expected for IKVM.Maven.Sdk? Any recommendations for how the Maven build process can be sped up?
Depends. The first build is definitely going to be a thing. Likely it has to download two dozen jars and convert them all. But that information is cached for subsequent builds.
Can you described what it looks like it's doing?
Also, while I'm here, I want to make ya'll aware of IKVM.Maven.Sdk. It kind of obsoletes projects like TikaOnDotNet, unless you guys provide some functionality beyond the Java library. Like extension methods, etc.
Thanks for the information and suggestion of IKVM.Maven.SDK. Unfortunately upon digging deeper I realized the IKVM Nuget package is licensed under GPL and as such won't work for me. In addition this project likely needs to have a change of license to accommodate the requirements of GPL.
Depends. The first build is definitely going to be a thing. Likely it has to download two dozen jars and convert them all. But that information is cached for subsequent builds.
Can you described what it looks like it's doing?
Here's a configuration snippet for a .NET 6 console app with no other packages or code that was slow for me:
`
<PackageReference Include="IKVM.Maven.Sdk" Version="1.0.2" />
<MavenReference Include="org.apache.tika:tika-app" Version="2.5.0" />
</ItemGroup>`
Here's another configuration that was slow for me:
`
<PackageReference Include="IKVM.Maven.Sdk" Version="1.0.2" />
<MavenReference Include="org.apache.tika:tika-core" Version="2.5.0" />
<MavenReference Include="org.apache.tika:tika-parsers-standard-package" Version="2.5.0" />
</ItemGroup>`
Building with these configurations is extremely slow even after the initial build. If I only include tika-core, then the build is fast, but that library doesn't have the parsers I need. I skimmed through the source code for IKVM.Maven.Sdk, and as near as I can tell, the Java artifacts will be downloaded and built every single time. It may just be the case that Maven is inherently slow when there are many/large dependencies.
After searching on Google for "how to speed up maven builds", one of the suggestions is to build the artifacts in parallel using multiple threads (e.g. "mvn -T 4 install"), and another suggestion is to use "offline" mode after the initial build so that maven doesn't check the internet again.
Unfortunately, the best solution for me might be simply manually building the tika dlls and adding them to source control.
and as near as I can tell, the Java artifacts will be downloaded and built every single time
Nope. The dependency graph is cached until it's changed.
After searching on Google for "how to speed up maven builds", one of the suggestions is to build the artifacts
We don't build artifacts.
I need to know where in the process you are experiencing a slow down. What is the output at when it's slow?
On tiki-app, my understanding is that's not a library you're actually supposed to use as a dependency, but a JAR file with all of the dependencies embedded into it, and a main entry point, for running it as an app. Like, all the logging stuff is copied into it. Which would break trying to use tika-app along with other Java libraries that use the same logging libraries.
Like imagine if a user tried to use both tika-app and also, say, I don't know, Apache Foobar. And both depend on commons-logging. The tiki-app JAR has commons-logging copied into it. While Apache Foobar uses the version from commons-logging. They'd be be using different classes and assemblies, and configuration for one wouldn't work right for the other. Which is really the reason to favor using a single source in the first place.
There is probably documentation about how Tika users in Java make use of tika-core and the parsers properly.
Yeah, from the documentation:
tika-app/target/tika-app-0.7.jar Tika application. Combines the above libraries and all the external parser libraries into a single runnable jar with a GUI and a command line interface.
They then go into details about which packages you're supposed to add for development purposes.
The first thing I tried was tika-core with tika-parsers-standard-package (as suggested here). I tried tika-app for a comparison out of curiosity mostly. In both cases, building the solution or project is painfully slow, with a couple minutes being spent at "Build started..." and a couple more minutes at "1>------ Build started: Project: IKVM_Testing, Configuration: Debug Any CPU ------" in the Output window.
I appreciate you taking a look at it for me, but this may be a case where nothing can be done about it.
Heh. Yeah. Okay, I got it reproduced on my end. Tika, with all the parsers, has 87 different dependencies. That's 87 JARs that need to be downloaded and individually converted to assemblies.
It looks like the holdup on subsequent builds is checking the cache itself. The cache is organized by hash of the transpiler information per-JAR. For instance, a JAR built with 20 different dependencies, is cached as long as the 20 dependencies themselves are cached, etc. Because any change in the graph could produce different results.
It's just taking a long time trying to even figure out if they've even changed. Let alone dealing with them if it has.
There are problem some optimizations I can put in here. Will look.
So it sounds like there is still a benefit of our project doing once what people would otherwise need to do on every build.
On Mon, Oct 24, 2022, at 1:54 PM, Jerome Haltom wrote:
Heh. Yeah. Okay, I got it reproduced on my end. Tika, with all the parsers, has 87 different dependencies. That's 87 JARs that need to be downloaded and individually converted to assemblies.
It looks like the holdup on subsequent builds is checking the cache itself. The cache is organized by hash of the transpiler information per-JAR. For instance, a JAR built with 20 different dependencies, is cached as long as the 20 dependencies themselves are cached, etc. Because any change in the graph could produce different results.
It's just taking a long time trying to even figure out if they've even changed. Let alone dealing with them if it has.
There are problem some optimizations I can put in here. Will look.
— Reply to this email directly, view it on GitHub https://github.com/KevM/tikaondotnet/pull/152#issuecomment-1289458331, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAMHIU3IEHOSXIXZXUSHDWE3LO3ANCNFSM6AAAAAAQCWG3ME. You are receiving this because you authored the thread.Message ID: @.***>
Sure. For the next couple hours or so, maybe. Heh. A speed problem around a cache is hardly a terminal bug in software.
IKVM 8.2.3 much improves the cache lookup speed.
I confirm that the build time is much faster. For tika-core with parsers, it was taking about 5 minutes to build an empty .NET 6 console project before, and now it's taking about 20 seconds. Thanks for fixing it so quickly!
It's at about 3 seconds for me. And I think I can get it down more. But this should be at least usable. Remember though: jar -> dll conversations are globally cached. So the same JAR, and input options (frameworks, references, etc) will always return from the cache. Even between solutions and projects. So, if you create a new Console project, from scratch, after already having used IkvmReference, it will return from the cache, even though it's a new project.
For somebody with no cache, it's still going to have to do the JAR -> DLL conversion.
@gomep342 , I was having problems getting the Tika core to find the parsers (probably due to missing CLASSPATH), so I just built everything into one dll targeting .NET 6, and that works for me. I did not make a pull request as my solution felt like a one-off for my own situation.
Hey, this is a great project! I was trying to add some word document processing to my latest .NET program and came across this project (which is so great but unfortunately doesn't support .NET Core and I found it difficult to get this pull request branch to compile and work in a project) and GroupDocs.Parser (unfortunately a commercial piece of software with a very limited trial, not very useful for a project I was only doing for fun, what a party pooper)
I noticed IKVM has been revived recently too and I threw together a small proof of concept that fits my needs, where I am able to use IKVM and tika to parse doc, docx, pdf files - https://github.com/souramoo/TikaOnDotNet - along with examples on how to use it in c#, for anyone who needs this functionality until this pull request gets merged!
I initially tried using IKVM.Maven.Sdk as suggested above, but actually this optimised a bit too much, leaving out the office and OOXml parsers, so I basically just made a quick java app that drags in these dependencies into a main function and then adding a reference to this jar in my csproj file (along with the IKVM dependency) got everything working :)
@souramoo
On the references. You need to include them exactly how you would in Maven. If, for example, Maven lists them as optional dependencies, MavenReference won't pull them in. However, if Maven does list them as optional dependencies, you can add them as MavenReferences, and they'll be ordered correctly.
If they're optional references but upstream forgot to actually add them as optional references, though, a bug should be filed upstream.
@wasabii thanks for the advice and great job on everything with ikvm-revived - it's very exciting stuff!
I was including both the tika-core and tika-parsers-standard-package using MavenReference directly into my project, but for some reason intellisense was pointing out that org.apache.tika.parsers.microsoft.OfficeParser was not available (despite being present in the jar file, I opened it up and checked!) - and despite other parsers such as AutodetectParser being available.
On top of that, AutodetectParser does some weird stuff to detect which Parser classes are included in the classpath which doesn't seem to quite work in IKVM so I had to specify manually (by building a quick and dirty function based on file extension).
By making another jar that drags in the parsers I want in the main function I think I convinced IKVM not to optimise the parser classes away (presumably because the main classes in the original package jar did not use the OfficeParser or OOXMLParser class directly)
Well, it'd be good to resolve some of those issues. The goal is it to work exactly like Maven would, as possible.
For instance, IkvmReference defaults to the "app domain assembly class loader" feature, where the static assemblies "believe" that they live inside a ClassLoader where their direct references are available first, followed by any other assemblies loaded into the current app domain (or available in the DependencyContext on Core).
So, if they do stuff like read resources (classLoader.getResource('')), it should scan properly: first directly referenced assemblies; second the entire loaded app domain.
Making sure the compiler knows about the direct assemblies is what's important for them to be found. Since JAR files themselves do not contain any dependency information like that (pre JDK9), IkvmReference correctly listing
@wasabii Any news on this? we would like to transition to .Net Core
@Arextion I don't see any remaining issues for IKVM on this thread that prevent Tika from running. Is there any I'm not aware of? It's been months since I tried.
@Arextion I don't see any remaining issues for IKVM on this thread that prevent Tika from running. Is there any I'm not aware of? It's been months since I tried.
@wasabii I haven't tried running this, i was just curious if this PR would soon be merged, or there would be a release to try out.
Oh, no idea. I don't work on the TikaOnDotNet project. I'm the IKVM guy.
I am pretty sure you can just install the appropriate Tika libraries out of Maven inside of your C# project now, and it should work (baring new bugs in the last few months heh). Looks like I tried last year and it was fine.
Just tried. I don't know really how Tika works, but adding a MavenReference:
<MavenReference Include="org.apache.tika:tika-core" Version="2.8.0" />
Resulted in Tika and it's dependencies being compiled, added, and seeming to work. Though I didn't get too deep into it.
@KevM Any news on this?
@Arextion, using MavenReference is something you can just do in your own project.
@KevM Any news on this?
Sounds like people are getting it working using native features of ikvm <MavenReference .../>
. So. I'd love to get a proof of concept from someone or better yet a PR I can take action on to demonstrate how to do this.
@Arextion, using MavenReference is something you can just do in your own project.
Oh you mean without using TikaOnDotnet at all?
Uh huh.
@KevM Any news on this?
Sounds like people are getting it working using native features of ikvm
<MavenReference .../>
. So. I'd love to get a proof of concept from someone or better yet a PR I can take action on to demonstrate how to do this.
YES please.
@Arextion Install https://www.nuget.org/packages/IKVM.Maven.Sdk/1.4.1, then add MavenReference to your project like I just showed.
@KevM Any news on this?
Sounds like people are getting it working using native features of ikvm
<MavenReference .../>
. So. I'd love to get a proof of concept from someone or better yet a PR I can take action on to demonstrate how to do this.
I think this is what I did in my proof of concept repo at https://github.com/souramoo/TikaOnDotNet :) (which despite its name does not use anything from this project and directly uses maven references)
@souramoo Looks like you used IkvmReference. So you have a Tika .JAR file sitting in your source repo, and you use IkvmReference to refer to that. So, similar. MavenReference just fronts all that for you by automatically managing artifact dependencies from Maven.
Actually, your poc is kinda weird. It installs IKVM.Maven.Sdk, but never uses it.
@KevM @wasabii I just tried with a .Net 6 project with MavenReference using:
<MavenReference Include="org.apache.tika:tika-core" Version="2.8.0" />
<MavenReference Include="org.apache.tika:tika-parsers-standard-package" Version="2.8.0" />
But it won't load any parsers it seems. I'm getting an empty result.
using var fs = new FileStream(filePath, FileMode.Open);
using var stream = new InputStreamWrapper(fs);
BodyContentHandler handler = new BodyContentHandler();
Parser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(stream, handler, metadata, context);
string text = handler.toString();
Any clue why?
@KevM @wasabii I just tried with a .Net 6 project with MavenReference using:
<MavenReference Include="org.apache.tika:tika-core" Version="2.8.0" /> <MavenReference Include="org.apache.tika:tika-parsers-standard-package" Version="2.8.0" />
But it won't load any parsers it seems. I'm getting an empty result.
using var fs = new FileStream(filePath, FileMode.Open); using var stream = new InputStreamWrapper(fs); BodyContentHandler handler = new BodyContentHandler(); Parser parser = new AutoDetectParser(); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); parser.parse(stream, handler, metadata, context); string text = handler.toString();
Any clue why?
I got the same result - it's why I had to make a wrapper piece of code (https://github.com/souramoo/TikaOnDotNet/blob/main/java_src/src/main/java/Main.java) to generate a jar file containing all the dependencies and prevent the parsers being optimised away, hence why I wasn't able to figure out how to use it directly from Maven
Nothing is optimized away. At a minimum you've got to know what Tika expects, though. I don't, since I don't know Tika. How does it locate parsers?
Nothing is optimized away. At a minimum you've got to know what Tika expects, though. I don't, since I don't know Tika. How does it locate parsers?
I got the same error when trying to load a parser directly. It just never appeared in the namespace and I got a class not found error, as if the standard parsers package was never loaded.
So creating a new OfficeParser instance in c# was generating a class not found exception for example. This went away when I combined everything into a single jar to load into ikvm
So, I don't know how Tika is laided out. But, if the upstream authors have published a set of artifacts, and described those dependencies correctly, MavenReference will use them. Except for POM bundles. Those don't work.
If the upstream authors have missing dependencies in their packages, it won't, and you'd have to manually specify those. And open a bug with them, preferably.
Here's what I added:
<MavenReference Include="org.apache.tika:tika-core" Version="2.8.0" />
<MavenReference Include="org.apache.tika:tika-serialization" Version="2.8.0" />
<MavenReference Include="org.apache.tika:tika-parsers-standard-package" Version="2.8.0" />
<MavenReference Include="org.apache.tika:tika-parser-zip-commons" Version="2.8.0" />
<MavenReference Include="org.apache.tika:tika-parser-text-module" Version="2.8.0" />
<MavenReference Include="org.apache.tika:tika-parser-pdf-module" Version="2.8.0" />
<MavenReference Include="org.apache.tika:tika-parser-image-module" Version="2.8.0" />
<MavenReference Include="org.apache.tika:tika-parser-xml-module" Version="2.8.0" />
<MavenReference Include="org.apache.tika:tika-parser-microsoft-module" Version="2.8.0" />
Then it worked.
I have no idea what contains what or why.
So, advice here:
Yes, you can build your own JAR files. This works fine with IKVM as it stands today. Probably won't work when we eventually get JDK9 support (since JAR files can be named modules in JDK9+).
You should prefer to look over the dependencies for the library you want to use, and understand the parts, and add a MavenReference to only those parts you actually need. Same as any Java user would do. Same as you'd do for NuGet packages.
You should NOT publish resulting IKVM assemblies to NuGet. Please. Ever. There's a big notice at the end of the IKVM README.md file for why not. If you are using MavenReference, it is safe to publish your packages to NuGet.org.
I'd love it if someone wrote up a Readme with all of this knowledge so that others can leverage it. It sounds like this project needs to go on hiatus if y'all get things figured out. This is fine with me but I'd like to update the docs to educate people on the correct way to IKVM in the modern era.
In response to an email sent specifically by Erik Gavriluk, I have uncovered two issues in IKVM.Maven.SDK which could impact ya'll. Both of which have patches incoming.
1) Transitive dependencies between projects was broken in 1.5.0. This means if you have a MavenReference in ProjectB, referenced by ProjectA, that dependency wasn't making it over. This was trivial to work around by just specifying the MavenReference in both projects. But a hotfix is being published shortly that solves it.
2) Incorrectly culling all but one unified artifact version. This showed up in a few places with slf4j-api. If LibraryA referenced slf4j-api:2.6.0, and LibraryB referenced slf4j-api:2.7.0, only the first reference to slf4j-api was preserved and unified (to 2.7.0). So LibraryB would compile without the reference added. This will be fixed in the same hotfix.
There are other things ya'll need to be fully aware of when using IKVM.Maven.Sdk: we are just a wrapper for Maven. If the author of a library published something broken in Maven, it's going to break IKVM.Maven.Sdk. And it might break it slightly differently than it would in Java.
One common example is underspecified dependencies. IKVM.Maven.Sdk relies on dependencies specified in Maven being correct, so that we can generate assemblies which properly reference each other. But, this might not break Java users, as Java JAR files have no actual dependencies: only the classes might depend on other classes, and this is only discoverable at runtime if the code is accessed. For instance, if an author forgets to depend on like, commons-logging, but nobody ever runs the code path that needs commons-logging, it'll work just fine in Java. It will also work if the users end up having commons-logging for some other reason, like they added it explicitely, or are using some other library which depends on it. In those cases, on Java, the .jar files will be added to the CLASSPATH, and it'll work fine. But, IKVM needs that information to generate assemblies.
In this case, the only true fix is to report the problem to the upstream authors of the library you are using, and have them properly fix their dependencies.
Second, assembly name generation. IKVM.Maven.SDK replies on the "automatic module name" specification of JDK9+ in order to choose assembly names. But, the situation with modules in Java is a bit weird. First, they don't exist in JDK8. Second, even in JDK9+ they are "sort of optional". That is, you can load a JAR file by specifying it on the MODULEPATH, or by adding the JAR to the CLASSPATH. This opens the situation where an upstream author may provide invalid module information: but nobody notices because all the users are using the CLASSPATH.
I have discovered at least one issue of this in Tika: tika-parser-crypto-module-2.8.0.jar
.
Notice the file name is tika-parser-crypto-module-2.8.0.jar. However, if you open the JAR, and look in META-INF/MANIFEST.MF
at the Automatic-Module-Name
line, you'll notice the value is org.apache.tika.parser.code
. This value is incorrect. The crypto-module JAR has a module-name for the parser code JAR. That's wrong. Each JAR file should have it's own unique module name.
As a consequence, IKVM attempts to name the assembly for tika-parser-crypto-module
as org.apache.tika.parser.code.dll
. But, it also attempts to name the assembly for org.apache.tika.parser.code
to org.apache.tika.parser.code.dll
. Resulting in two assemblies with the same name. IKVM then adds a reference to both to your .NET project. Except they have the same name. So they get copied into bin/ and clobber each other.
This is a bug in Tika upstream. It needs to be reported to Tika upstream and fixed there. There's not much I can do about it.
@wasabii thanks for the wrieup and time spent on this!
.Net Core Support
We have long wanted to add support for .Net core and earlier this year IKVM was finally "revived" to have support for .net core. At first, I gave up because
ikvmc.exe
didn't seem to work at all (and still does not for our use case). But @dylanlangston created a proof of concept using IKVMReference and msbuild to extract dotnet assemblies from the tika .jar file.Nugets
Do we need to target .Net 6?
Tests
All tests but one are passing. For some reason parsing our test .rtf file throws a java
UnsatisfiedLinkError
exception:If anyone has an idea what this might be related to please help!😖
Build / Deployment Automation
We are going to move away from Packet and the F# build automation to use GitHub actions to build/test and deploy nugets. I'd like updating the version of Tika to be a simple update of a version file. We are close with what @dylanlangston started for us.
Tests are "mostly" passing with plain msbuild and me hammering out this at the command line to produce a tika nuget.
Nuget Packaging
The nuget has been updated to better represent the license, readme location, project url, and finally I've added an icon.
Icon
I spent 30 seconds creating an icon to prettify the Nuget presentation. I anyone would like to improve on what I started. Please do. I am not a design person.