KevM / tikaondotnet

Use the Java Tika text extraction library on the .NET platform
http://kevm.github.io/tikaondotnet/
Apache License 2.0
197 stars 73 forks source link

Build fails when source was downloaded from GitHub and not cloned via Git #52

Closed aahock closed 8 years ago

aahock commented 8 years ago

And it is very frustrating, to have 9 people who have spent so much time in order to publish something like this, only to have it wasted because no one can use the library as it stands. There is no way to download libraries, so one is left to attempt to compile. And in 30+ years of coding, I've never seen so much difficulty just to get something to compile.

Just to start with:

  1. The build references an old version of the tikalib jar files. This needs to be updated.
  2. When one downloads git on a windows 8.1 machine, git is inexplicably installed in a users\username\appdata\local\github\ (???????) folder! What the heck is up with that? To make matters even more dubious, git attaches a folder named 9w8pwoeropweiurpowisdsoiufs[o or something like this, where git is actually installed. So adding it to a path is not a simple task, especially when the user name has spaces in it. It's just a bad idea to hide an app in a user folder. I have no clue what the thinking is behind this.
  3. After 3 days of trying all kinds of things, including verifying that git IS in fact, in my path, I can STILL not get this to complile. All I receive is the following error, no matter what I do:

Checking Paket version (downloading latest stable)... Paket.exe 3.3.6 is up to date. Paket version 3.3.6.0 0 seconds - ready. Building project with version: LocalBuild Shortened DependencyGraph for Target RunTests: <== RunTests <== Build <== CompileTikaLib <== SetVersions <== Clean

The resulting target order is:


Build Time Report

Target Duration


Clean 00:00:00.0022169 Total: 00:00:00.0875770

Status: Failure

1) System.Exception: Could not run "git rev-parse HEAD". Error: Start of process git.exe failed. The system cannot find the file specified at Fake.Git.CommandHelper.runSimpleGitCommand@89-2.Invoke(String message) in C:\code\fake\src\app\FakeLib\Git\CommandHelper.fs:line 89 at Fake.Git.CommandHelper.runSimpleGitCommand(String repositoryDir, String command) in C:\code\fake\src\app\FakeLib\Git\CommandHelper.fs:line 89 at Fake.Git.Branches.getSHA1(String repositoryDir, String commit) in C:\code\fake\src\app\FakeLib\Git\Branches.fs:line 32 at Fake.Git.Information.getCurrentSHA1(String repositoryDir) in C:\code\fake\src\app\FakeLib\Git\Information.fs:line 60 at FSI_0005.Build.clo@74-2.Invoke(Unit _arg2)

at Fake.TargetHelper.runSingleTarget(TargetTemplate`1 target) in C:\code\fake\src\app\FakeLib\TargetHelper.fs:line 492

C:\downloads\tikaondotnet-master\tikaondotnet-master>git rev-parse HEAD fatal: Not a git repository (or any of the parent directories): .git

C:\downloads\tikaondotnet-master\tikaondotnet-master>

Can someone please help? I notice that the last comment was made over 3 months ago, referencing, it seems very similar issues, and in that time, nothing has been resolved.

I am NOT compiling this in visual studio, as was specified in your readme. From the command line, I type 'build' in the tika download folder. And no matter if git is in my path or not in my path, I still get the above error.

  1. Also, it seems there are files trying to get compiled that don't even exist in the sources. I'll get nailed for saying this, but it's a complete mess. How can someone feel good about the quality of the conversion, if there are so many errors in just getting it to compile? I realize the work you folks have done in this, but what's the point if no one can actually use it? Isn't THAT the whole point?
aporquez commented 8 years ago

I didn't encounter the errors you mentioned except the outdated version of tika.

This is what I did in my machine:

tika

TechnikEmpire commented 8 years ago

@aahock Are you aware that you can use a precompiled version of this project via NuGet? Unless you absolutely need some super innovative shiny new feature of a newer version of Tika, just right click your project->Manage Nuget Packages->Search "TikaOnDotNet"->Click Install->Fin.

@KevM Thanks for putting this project together btw, came in mighty useful for me in a crunch when another PDF lib failed me (I'm using the built-in PDFBox of course).

KevM commented 8 years ago

@TechnikEmpire, @akenonakamura 💯 Thanks for your help!

@aahock I hope your day is going better. I worked really hard to make this build automated. I am sorry you had some problems with the automation around Git on your machine. It seems you downloaded the source and didn't clone the repo. Easy to do when GitHub makes it so easy.

Git is only used to inject the current commit's SHA into the generated assembly's metadata. Just an accounting technique if someone forgets to rev the version number. If you'd like to submit a PR to make Git optional in the build script this is the place to start:

https://github.com/KevM/tikaondotnet/blob/master/build.fsx#L74-L80

KevM commented 8 years ago

Thanks for putting this project together btw, came in mighty useful for me in a crunch when another PDF lib failed me (I'm using the built-in PDFBox of course).

@TechnikEmpire 😻 So glad it helped you out in a pinch. You made my day.

aahock commented 8 years ago

Hi KevM. Yeah, sorry about the obviously frustrated programmer message. ;-) Glad you didn't take it personally. And yes, I downloaded the zip file, as I've always done before when accessing a library on git, and it failed. I am currently trying to clone it. I have a slow connection (I live on a boat, so have to use wifi and it's only fast for the first 8gb of the month), so it will take a while.

As to where git is installed, the version I downloaded (git gui and command line for windows) didn't ask me where to install it! It just 'did it'. No option for 'custom install' that I could see. And as I stated above, it installed it in a folder similar to: c:\users\a* h*\data\libraries\opiuop34u53o4u53o4ui53o4u53pou53o4u5\github\ I kid you not.

And even finding that was a journey into never-never land. I come from the days of DOS, and have kept up with the changes in windows since, oh, win 2.x (yes, that buggy pos), But I had to go to my start menu folder on the hard disk, see where git had installed the start menu items, and then start investigating from there. I do finally have it in my path, but I hope the path doesn't explode from all the spaces and crap in the git path.

Anyway, thanks for responding guys. And also about the libraries. NO I DIDN'T KNOW I COULD DO THAT! That was my first desire...just to get the libraries. Wish that was explained somewhere in the git page...

aahock commented 8 years ago

Ack. I spoke too soon. So, I want to download the library using NuGet. I go to VS 2012 (all I got...times is tough!), go to NuGet, and search for Tika. Here's what I get:

  1. Tika App. (Well, sounds good, but what's the diff between this and...)
  2. TikaOnDotNet, or:
  3. TikaOnDotNetTextExtractor or:
  4. Tika Text/Content Extraction Library????

I mean, geez. Four freaking downloads that, to this fool, all look like the same freaking thing...except for the ONE (2) that actually READS like it's what I would want from the title, but it states it is 'barebones'...whatever that means.

I really don't mean to be a thorn in everyone's side, but this whole process should be a LOT easier than what I have experienced...and I can't be the only one who's gone through this. I may be the loudest, though...;-)

Can anyone point me to the correct package? I'd rather just have the whole library with all functionality. Then I can decide what I can get rid of. I'm going to be using this as a text extraction tool for searching files, or all possible types (doc, docx, xlsx, ppt, pdf, all kinds of compressed files, etc.)...

aahock commented 8 years ago

Okay. Someone needs to change the title of this, because I just finished TRYING to clone this and compile tika and I REC'D THE SAME ERROR I DID when I downloaded it!

Checking Paket version (downloading latest stable)... Paket.exe 3.4.0 is up to date. Paket version 3.4.0.0 0 seconds - ready. Building project with version: LocalBuild Shortened DependencyGraph for Target RunTests: <== RunTests <== Build <== CompileTikaLib <== SetVersions <== Clean

The resulting target order is:


Build Time Report

Target Duration


Clean 00:00:00.0031846 Total: 00:00:00.0947874

Status: Failure

1) System.Exception: Could not run "git rev-parse HEAD". Error: Start of process git.exe failed. The system cannot find the file specifie d at Fake.Git.CommandHelper.runSimpleGitCommand@89-2.Invoke(String message) in C:\code\fake\src\app\FakeLib\Git\CommandHelper.fs:line 89 at Fake.Git.CommandHelper.runSimpleGitCommand(String repositoryDir, String co mmand) in C:\code\fake\src\app\FakeLib\Git\CommandHelper.fs:line 89 at Fake.Git.Branches.getSHA1(String repositoryDir, String commit) in C:\code\ fake\src\app\FakeLib\Git\Branches.fs:line 32 at Fake.Git.Information.getCurrentSHA1(String repositoryDir) in C:\code\fake\ src\app\FakeLib\Git\Information.fs:line 60 at FSI_0005.Build.clo@74-2.Invoke(Unit _arg2) at Fake.TargetHelper.runSingleTarget(TargetTemplate`1 target) in C:\code\fake

\src\app\FakeLib\TargetHelper.fs:line 492

C:\downloads\tikaondotnet\tikaondotnet>

I give up. Geez. When I tried to download using the package manager console, it added tika libs to the wrong project, and no matter WHAT I DO, I cannot get it attached to the correct project in my solution. What a mess.

aahock commented 8 years ago

Also, when one tries to UNINSTALL tikaondotnet from a VS Project, it does NOT uninstall all the IKVM packages. Just the tika library. SO I've tried BOTH ways of doing this, and not one will work. Heck, because there were 4 options from nuGet I don't even know if I downloaded the correct one.

aahock commented 8 years ago

Geez. I JUST wanted, 4 days ago, to download tikaondotnet library so I could use it. What a freaking nightmare. And waste of 4 days of my life I want back!

TechnikEmpire commented 8 years ago

@aahock Okay, let me explain a few things. First, you only need the TikaOnDotNet package.

Second, IKVM is required for TikaOnDotNet, and here is why. IKVM is a port of the full Java OpenJDK made for .NET. What this does is makes it so you can run any Java code directly inside of a .NET application. Apache Tika isn't just a Java library, it's a massive collection of Java libraries.

This brings me to my next point. This is a huge collection of Java libs, and it's not a trivial task to get such a collection compiled and running in IKVM. I have a fair bit of experience using IKVM to port over Java libs, so I'm speaking from experience. This is also why I was so thankful to @KevM for putting this project together, because it's not simple.

That comes to the next point. This is 100% free and open source, @KevM has done this entirely for free and without restriction. Free work basically. I can tell you, and not trying to be rude at all here, the difficulties you're having are a result of your own inexperience and not a deficiency with this project. I realize you said you're not trying to be a thorn, you're frustrated granted but your language is pretty negative towards this project, it's putting down the free work that the repo owner has done.

Keep calm, don't let your frustrations turn into rants against the people you need to help you and you're going to get help!

Another thing I'd like to mention is that whenever you're commenting on a repository, absolutely everyone watching the repository will get an email with your remarks. On this repo there's 14 people, one of them me. Just for your info.

As far as GIT goes, uninstall whatever you installed right away and remove it from your path. You should not be manually adding anything to your path. If you install the Github Desktop App, it will take care of installing the proper GIT and will create shortcuts to a "Git Shell", which you can open and use whenever you need GIT in your path.

Lastly, uninstall Visual Studio 2012 and download Visual Studio 2015 community edition. It's a 100% free version of a fully functional, professional version of Visual Studio. It's free for independent developers and open source software development teams. I abandoned my paid version of Visual Studio 2013 Pro in exchange for this, so I can assure you it is as I've advertised.

Stick to Nuget packages. Manually compiling projects from source, especially large projects, is incredibly difficult if you don't have experience with it. I have projects of my own that are extreme pains in the arse to build, taking more than an hour to fully build from scratch when you factor in all the configuration you need to do, and that's with my experience of working on those projects for two years full time.

Install the TikaOnDotNet package in Visual Studio 2015 and then start writing some code by looking at the examples here. The examples given are in Java, but remember that thanks to IKVM, you're now running Java inside .NET. You will need to change things such as removing throws declarations from function definitions if you simply copy and paste Java code, the the IDE will whine about it and guide you into making this minor fixes.

Oh and that's one more point, if Nuget doesn't remove IKVM or any other package, you can just select that package and remove it as well inside the Nuget Package Manager. The reason it didn't get removed is because it was installed as a dependency, and removing dependent packages doesn't remove their dependencies, you need to do it manually.

Some resources:

https://www.youtube.com/watch?v=8pVrkbgyqgg

https://www.youtube.com/watch?v=F8sx49NdCNk

If you care about what IKVM and how it works:

https://channel9.msdn.com/Events/Lang-NEXT/Lang-NEXT-2012/IKVM-NET-Building-a-Java-VM-on-the-NET-Framework

aahock commented 8 years ago

Look. I know very well about open-source projects. I was working on open source back in the 90s and early 00s. And unfortunately, I'm not a member of the PC crowd, so when something is messed up, I point it out.

Not experienced? You must mean with github. And you're right. About that. But I am very experienced with getting projects to build from little information. I've been doing it for decades. However, there are projects that fail miserably, and they do because no one can actually do anything with the code.

Just as an example. All most people would want to do with this package (if they knew about it, that is) is to do text extracting from files. But there isn't one example that shows a user how to do this. Not one complete file. I haven't even downloaded anything successful yet, and I went to look at those examples, and would have no idea where to find any of those functions. From what I saw of the ikvm and tika download, (which btw, was one of four tika nuget downloads...I still don't even know if it was the correct one), there are hundreds of potential import files you could include in your apps. How would anyone know which one holds the functions they want to use? Do we have to download ILSpy or something and navigate through the libraries until we find what we're looking for?

Just having one example that shows a complete .net file doing something, would be a large improvement over what you have know...which is a complete mess. You might not want to hear it, but as an outside observer, with no bone in this battle, I can assure you that what you have presented to us, the public is unsafe for use in anyone's project that expects any kind of success.

As to the java, I'be been working with java since 1996. It was announced in 10/95, and as usual, I was an early adopter. Worked in the advanced technology group for a year at fidelity investments working with folks from sun, including the dude that wrote the original 'java threads'. So I'm not unfamiliar with any of the other tools here, have probably been working with java longer than some of you have been alive.

Of course, that doesn't make me any smarter than any of you, or stupider. It just means I have lots of experience with nacent projects. Of course, it hasn't helped at all that I downloaded git as you suggested I needed to, and now you are telling me that the version of git I downloaded is somehow not right? I downloaded it from the git download page! The version for windows 8.1 that included the gui and the comand line interface. Why would I need to download yet another version? And VS 2012 has absolutely nothing to do with this issue. If I attempted to download via nuget, I would still see four different tika nuget packages. I would still end up with tika downloading into the wrong project. The only thing tika cares about from vs is msbuild, which I can assure you is there.

The bottom line is a user has spent 4+ days attempting to get this library just to install. I now see that even if I did get it to install, I would have to spend another week at least, figuring out where the metadata, parser and extraction functions/classes are. It just isn't worth it. If this project ever gets worked on by someone who understands the minimum level of expertise and familiarity that people should have to have is miles less than what is showing today. Sorry if you don't like the truth, but trying to blame the user for the sloppiness of the presentation is a waste of time, because anyone who independently is able to read the discourse back and forth here (of course, if you don't delete it forever), will see that there are issues involved in this that have nothing to do with whether git is installed correctly or not (ie. git is installed correctly. I've been installing things long enough to know when something isn't installed right. Asking me to download yet another copy of git, is mind-boggling, considering the lack of success I've had downloading anything else you've recommended.

KevM commented 8 years ago

@aahock You do not need to build this project. You can simply install the Nuget package and move on. 👼 The build automation is in place to make updating to newer versions of Tika turn key. 🔑

I think @TechnikEmpire was very helpful to you and the guidance given was spot 🎯 on.

I am sure you have a depth of technical experience. For the last 4 days 📆 it seems like you've been a bit out of your comfort zone when it comes to consuming Nuget packages. We live in a different package 📦 ecosystem now from the old days of COM and .dll hell (I was there too in the trenches). Please do take time to come up-to-speed on how things are done now if you want to consume other's work. Please do this before ranting 😮 in public spaces.

Anyone can install our Nuget package and be up extracting text from documents 📃 in moments. I am sorry that this was not your experience. Take a moment to reflect and learn from your failure here. I will do the same to see what I can do to improve things for future users. 🙇

TechnikEmpire commented 8 years ago

@aahock "If this project ever gets worked on by someone who understands the minimum level of expertise and familiarity that people should have to have is miles less than what is showing today."

Now we're slinging insults. Let me settle this pretty easily. Using this project, I've written in a matter of days a custom piece of software that is actively being used by a massive book publisher to automatically analyze and suggest edits in books before hitting the press. Given that fact, who doesn't have expertise here?

The old "I've been doing this for a super duper long time" assertion is a huge tell of one fact: you're deficient and instead of wanting you admit it, you're trying to put others down and prop yourself up. You literally only forgot to start off with "now listen here, boy". People here have tried to help you and instead of learning, you're throwing a tantrum and screaming at everyone.

There very well might be some improvements to be made to this repository. However, given the fact you don't seem to possess the observational and deductive reasoning skills to see that installing any one of the other three NuGet packages installs "TikaOnDotNet" as a dependency, meaning that it's obviously the core project, or that this project isn't written at all by @KevM, but is rather a light wrapper around a Java project called "Apache Tika", which again by using deductive reasoning skills means you need to look at that project for examples and documentation..... those issues are definitely not going to be resolved by you.

Thanks for crapping on everyone's day. With this attitude it's no wonder that you're incapable of being taught.

Oh and you should check out https://www.logicallyfallacious.com/tools/lp/Bo/LogicalFallacies/50/Argument_from_Age. Age doesn't make anything better, and in fact probably means its worse, unless we're talking about cheese and wine.

KevM commented 8 years ago

Locking down this issue as it is no longer productive. Thanks to all participants for weighing in. We'll use the wisdom gained to make things better here. @aahock please feel free to contribute down the road. It would mean a lot to me if we could find a positive outlet for your skills and energies.