dkpro / dkpro-tc

UIMA-based text classification framework built on top of DKPro Core and DKPro Lab.
https://dkpro.github.io/dkpro-tc/
Other
34 stars 19 forks source link

Add XGBoost as MLA #460

Closed Horsmann closed 6 years ago

Horsmann commented 6 years ago

http://xgboost.readthedocs.io/en/latest/

Horsmann commented 6 years ago

@reckart XGBoost has no maven release. There is a third-party fork, which doesn't seem to work. I don't think there will be a working release in the future.

Building and using the binary directly does work. Furthermore, the binary has something like a version by the release on GitHub (0.7). So, I am thinking of integrating this tool as self-build binary.

This introduces of course the problem of linking to third party libraries. I am not entirely sure but I think gcc dependencies were acceptable?

I have these dependencies in the binary atm:

    /usr/local/opt/gcc/lib/gcc/7/libstdc++.6.dylib (compatibility version 7.0.0, current version 7.24.0)
    /usr/local/opt/gcc/lib/gcc/7/libgomp.1.dylib (compatibility version 2.0.0, current version 2.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1252.0.0)
    /usr/local/lib/gcc/7/libgcc_s.1.dylib (compatibility version 1.0.0, current version 1.0.0)
Horsmann commented 6 years ago

@reckart

I have on Linux the following dependencies when building no statically

    linux-vdso.so.1 =>  (0x00007ffdbff3c000)
    libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fa1cbdb2000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fa1cbaa9000)
    libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007fa1cb887000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fa1cb671000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fa1cb454000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa1cb08a000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fa1cc134000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fa1cae86000)

What I have read so far is that you probably do not want to statically link all these dependencies. As a matter of fact when compiling with the -static linking flag the compiled binary does crash.

I copied the compiled binary over to other Linux machines we have running and it is working. It is of course dangerous to some extend but in this case I would probably not link statically.

@reckart Any thoughts on this? This module would be easy to integrate from the Java/TC side its just getting the binary prepared that troubles me a bit.

reckart commented 6 years ago

Some basic POSIX libraries like the ones you mention above should not be linked statically - but then they seldom change their APIs, so that should be fine. I have a peek at the XGBoost site and saw that there is some script to create a JAR file with binaries for multiple OSes which could be uploaded to a Maven repo (e.g. JCenter) - didn't check if you made any use of this in your integration. What solution did you end up going with?

Horsmann commented 6 years ago

I compiled the binaries for the respective OS platforms manually. I didn't see this JAR version you mentioned. Unless this jar (do you have a link?) works as-is. I would continue with the binary-compiling.

At the moment I have as open issues here:

reckart commented 6 years ago

See https://github.com/dmlc/xgboost/issues/1807#issuecomment-342637198

Horsmann commented 6 years ago

Thx. This script downloads the dynamic libaries for the 3 platforms into the jar but the binaries are not included.

I think I will continue with the binary compiling. @reckart Do you want to have a look regarding the static compilation on Linux? I had no luck with creating a working version.

bildschirmfoto 2018-03-15 um 20 08 28
reckart commented 6 years ago

My understanding is that XGBoost has Java bindings (XGBoost4J) which make use of these shared libraries directly without having to go through other CLI binaries.

Horsmann commented 6 years ago

There are some packages on maven central but from other developers https://search.maven.org/#search%7Cga%7C1%7Cxgboost4j. This is one of these won't release to maven tools.

I think its more work to get the java version to maven. It would be unfortunate if I don't get the windows 32 bit version working but I don't think that there are this many 32 bit windows systems left anyway. I still favor the binary-building way for the lower workload. With respect to the dynamic dependencies this does not really seem to offer much advantages anyway?

reckart commented 6 years ago

I assume that you currently build binaries that you then call from Java like command line tools. This creates a new process every time and you have to parse the output of the tool from stdout or from a file. Calling the code directly via its native interface avoids the overhead of creating a new process and also the need to parse data as you can work directly with native objects.

Horsmann commented 6 years ago

Yes, I start a new process and wait for its termination. This is some overhead, true. I think I am not still in favor of the binary way since I have done the work to the largest extent now. Furthermore, this 3rd party wrapper which probably did something similar as this jar with libs (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/xgboost.html) says at the bottom of the pages that it is not working for Windows. I haven't looked into it in detail but I don't think that this small merit is worth the additional effort of re-coding the interfacing to fit to this jar file.

Horsmann commented 6 years ago

@reckart The UKP linux Jenkins is missing GLIBC apparently Is this something you could install on the UKP Jenkins ?

/lib/x86_64-linux-gnu/libm.so.6: versionGLIBC_2.23' not found (required by /tmp/dkpro2774502877448388853runtime/xgboost?

reckart commented 6 years ago

The build server is still running Debian 8 (Jessie) which only has an older version of libc6 (2.19).

Horsmann commented 6 years ago

Would you update this lib? On our Jenkins the linux test case is passing.

reckart commented 6 years ago

It would require updating the entire VM. Cannot promise when that will be done.

Horsmann commented 6 years ago

I see. What is the best way to deal with the library issue? As-is there won't be any stable builds in the near time. Removing the test cases has a taste to it, too.

Windows 32 bit doesn't seem to be supported at least I cannot compile a working version. What I have seen so far, all tutorials use 64bit mode. The inception of the project is from 2016, so I am not surprised that this is not supported. The Linux 32 bit binary is in the package.

reckart commented 6 years ago

Well, I see three options:

Horsmann commented 6 years ago

The windows binary might have the same problem. I think I installed the 2015 c++ redistributional to have a certain .dll available. The Jenkins windows has the 2010 package installed according to the documentation - I assume you can't update this either?

reckart commented 6 years ago

I think on Window it is less of an issue as it should just be sufficient to install the package (no OS upgrade necessary). Do you have a link handy?

Horsmann commented 6 years ago

Here: https://www.microsoft.com/de-de/download/details.aspx?id=48145

reckart commented 6 years ago

@Horsmann ok, I have installed the VC2015 redistributable package on the Windows build server.

Horsmann commented 6 years ago

Build passed :)

Horsmann commented 6 years ago

@reckart I uploaded a wrong artifact to the UKP repository. There is one binary org.dkpro.tc.ml.xgboost-bin-20171230.2 that has as group id org.dkpro.tc.ml. Sorry - could you delete this please?

reckart commented 6 years ago

Deleted.

Horsmann commented 6 years ago

thx