Add support for deep learning frameworks

Horsmann commented 8 years ago

Add keras as deep learning framework. Keras (https://keras.io) is a more easy to use deep learning project which uses Theano or Tensorflow in the background. It is furthermore a Python implementation.

Basic idea is to write an output file to disk that is suited for Keras, run training, retrieve the output and feed it back into TC to evaluate it. At some point we would have to bridge between the Java/Python world.

Questions are (1) how much Java do we want. We can do almost everything on the Python side making a thin Java facade in TC. Doing the opposite is possible too and doing all preparation work for running Keras on the Java side and then only calling Keras.

(2) how to pass through resources i.e. Word embeddings (https://github.com/dkpro/dkpro-core/issues/805)

2017-05-20

Current structure (Train Test)

PreparationTask

Create mapping from units to integer over all data (train+test)
EmbeddingTask
Filter Embedding file to only contain tokens occurring in the data
Random initialize unknown tokens
VectorizationTask
Apply mapping
If length-limit is provided cut down length of sequence to fit this limit
Fill vectors with zero to enforce equal length
Write vectors to disc
DeepLearning Task
Provide mapped train, test, embedding and expected outcome-file location to user-code

ToDo / Questions / Notes

[x] Get cross validation working (nested importing)
[x] Vectorization
[x] Embeddings
- [x] random initialization of vectors if token is unknown
  DL Frameworks
[x] Keras
[x] Dynet
[x] Deeplearning4j

Jenkins

[x] How to prepare / configure a Jenkins setup that finds a working and keras-using installation of python?
[x] User should likewise be able to run the experiments without having to reconfigure too much

Testing

[x] Find beta-testers

Horsmann commented 8 years ago

Regarding (1) I think it is most reasonable to have Java facade and do all work on the Python side. We would have to agree on a data format for document/unit and sequence classification i.e. we write a fixed output format to disc and the Python scripts would then transform things into a N-dimensional numpy array.

Any opinions on this matter?

reckart commented 8 years ago

Can you break down "all" the things that could be done on the Java/Python side into specific steps? That would probably generate more feedback about which particular steps people would like to see either on the Java or Python side.

mwunderlich commented 8 years ago

Hi Tobias,

I am just curious and you’ve surely had many discussion on this topic already - discussions which I’ve missed - but why Keras? There is Deeplearning4J (http://deeplearning4j.org/), a Java-based DL framework, which also open-source. So, I am wondering what the reasons were for going with Keras?

Cheers,

Martin

Am 17.09.2016 um 18:46 schrieb Tobias Horsmann notifications@github.com:

Regarding (1) I think it is most reasonable to have Java facade and do all work on the Python side. We would have to agree on a data format for document/unit and sequence classification i.e. we write a fixed output format to disc and the Python scripts would then transform things into a N-dimensional numpy array.

Any opinions on this matter?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-tc/issues/388#issuecomment-247787591, or mute the thread https://github.com/notifications/unsubscribe-auth/AC92L8WGHqaPJL6N3pXKwVfXnv8vN6epks5qrBlSgaJpZM4J_pMA.

Horsmann commented 8 years ago

@mwunderlich Just following the trend that other machine learning projects picked up Keras instead of DL4j. Except for the conversation on the mailing list there was not much discussion so far. DL4j is still strongly evolving with frequent releases of new versions when I checked the last time (monthly). deeplearning4j-nlp jumped from version 0.4. in July to 0.6 in September. This is a bit problematic as we would have to depend on a certain version when releasing and it is more than likely that we would be outdated by that time.
Going with Keras would essentially shield us from being bound to a specific version we would just take what we find on the system. Keras can be updated by the user if necessary. Furthermore, the work of the other ML projects allow peeking into their code which might turn out as helpful :) Dl4j is something for the future but right now I would say its too early.

Horsmann commented 8 years ago

"all" the things that could be done on the Java/Python side into specific steps?

For the time being I see essentially two big steps which might break up into smaller ones over time. The only Java/Python choice is the data transformation step how to bridge the two worlds.

Defining the DL-Keras setup (layers, etc.)
- Defining a code-block with the layers etc. as code snipped or resource and then building a working python script together which is then executed.
- Provide the script only as file pointer and leave it entirely to the user to provide something that is executable
Data transformation Java/Python
- Java writes already a vector representation as integers and one-hot vectors to disc. Keras would cast this to some numpy data types but the main work of building the vectors is done on the Java side.
- Java writes a textual representation to disc. We would have to define how a document, unit or sequence output has to look like. Conversion scripts in Python would than take care and bring things into a integer/one-hot representation.
- We should make up our minds which Python version we want to support by the way.

mwunderlich commented 8 years ago

@Horsmann I see, thanks a lot for the clarification, Tobias. I agree it makes sense to go with something that is more stable and then look into DL4J maybe later again.

reckart commented 8 years ago

I think the definition of the network should be in Python-land.

Generally, I would tend towards Java preparing the vectors and Keras just reading them. However, there can be a very high redundancy in the vectors and great waste of time and space. Thus, I think that we would probably be better off to have a tabular file format where we define the transformation of columns in to vectors in a programming-language-independent way. E.g. "colum 1 uses embeddings from file X", "column 2 uses 1 hot encoding", etc. I don't know if there already is some suitable file format supporting this kind of binding vector semantics to columns that we could use.

The format should allow us to create further backend implementations, e.g. using DL4J, Factorie, etc.

carschno commented 8 years ago

Note that DL4J is working on a Keras-like wrapper for Scala. This is still in early alpha and I am not sure if any degree of compatibility to Keras/Tensorflow/Theano (e.g. with respect to models) is planned.

carschno commented 8 years ago

Here's the link to the afore-mentioned Keras-like wrapper to Deeplearning4j: https://github.com/deeplearning4j/ScalNet

Horsmann commented 7 years ago

I will continue this issue, too. The idea is to focus on the preprocessing part and provide tools to bring data in typical data structures that you need for deeplearning.

For instance, seq-to-seq would translate a sequence of tokens and labels into integer array (e.g. PoS tagging). seq-to-label or document classification would create a fixed-size integer array of the words in a document with its corresponding gold-label. This prepared data structure has than to be read and casted into the DL-platform format - numpy in case of keras. If the arrays are available in the right dimensions this is probably easier than to code the mapping and converting into vectors oneself. Another feature would be filtering a provided embedding to only include the words occurring in the train/test setup.

This is the rough functionality I would aim in the first iteration.

The second step would then be prototyping a small round-trip to calling some keras code with data read in TC and receiving the output.

reckart commented 7 years ago

Is this a kind of a generic features -> numeric feature vector translation framework that could be used to connect to any type of DL algo?

reckart commented 7 years ago

I'm thinking specifically of DL4J.

Horsmann commented 7 years ago

Yes. Actually there is not much more that you can do on the TC side. I will include a setup for sequence-2-label such as review-sentiment classification and sequence-2-sequence e.g. PoS tagging. The NN construction and how to make the dimensions fit for the respective architecture is left to the user. I expect that the user sticks to the contract of writing me a result file with the prediction/gold labels that I can read - but the DL part is essentially a preprocessing help (only).

Horsmann commented 7 years ago

At the moment I work mostly with Keras but the effort to add DL4J should be minimalistic. Its just defining a call-stub for TC to start the DL-Code.

Horsmann commented 7 years ago

@reckart @zesch @daxenberger I am stuck here with the same problem as in the other branch when trying to implement CV. I need information from the initTask in the inner tasks. What could somehow work is to hack around and passing the file-system pointer from the initTask into the tasks that need this information. This certainly will leak through the entire code base with some info being lab-like passed and some via constructors (and countless ugly if-else checks). This would work now but probably bite back at various other occasions. Maybe a more basic question - is importing the key the way I do it in the previous commit of this post even correct?

I need help to get the CV implemented :|

reckart commented 7 years ago

As mentioned in #403, this should be fixed in DKPro Lab now.

Horsmann commented 7 years ago

@reckart super! thanks. looks good :)

reckart commented 7 years ago

@Horsmann My experience with embeddings was that it took always a lot of time at the beginning of an experiment to load the embeddings into memory. This was rather annoying, in particular when just running small experiments or trying things out.

We have nice classes - BinaryVectorizer/BinaryWordVectorUtils - in the DKPro Core embeddings module that can read/write embeddings in a format which doesn't require to load the whole embeddings file into memory. It uses memory-mapped access. How about using these in DKPro TC too for working with embeddings?

Horsmann commented 7 years ago

a lot of time at the beginning of an experiment to load the embeddings into memory

At the moment I have a task which filters the embeddings to only contain tokens occurring in the training data. Depending on the size of the data set this is in my experience usually not more than 30 MB (uncompressed plain text) for large data sets. This loads rather quickly?

doesn't require to load the whole embeddings file into memory

I am bit concerned about interfacing this data format to other DL frameworks. I would want to use the existing classes provided by DL4j or Keras (if available). Especially DL4j provides a lot of stuff already. I don't want to introduce a TC-flavor of DL4j. Otherwise I end up writing countless TC-to-DLframework classes which also makes it more difficult for the users to integrate their code because they are forced to learn the tc-flavor of this framework.

reckart commented 7 years ago

The glove vectors are quite a bit larger than 30 MB - more like 500 MB. That takes quite a while.

The code in DKPro Core is easily compatible with DL4J. Actually, the history behind it is coming from the DL4J corner. I was annoying by the slow reading of embeddings, so I asked on the DL4J gitter channel about some faster alternatives. @treo, who hangs out a lot on DL4J, then provided the first draft which was a bit refined by me and @carschno. It's small code and you already depend on DKPro Core anyway. It's not really a new framework, just an alternative way of loading embeddings.

Horsmann commented 7 years ago

@reckart I mean that I collect all occurring tokens, read the embedding one time and throw out all vectors if they do not occur in the data. I work with the Glove vectors and they are rarely bigger than 5MB (certainly depends how large the data set is - but I think this is still pretty fast).

treo commented 7 years ago

The binary word vector loader that @reckart mentions uses a pretty simple format most of it is just a single low level float array.

You could easily use the same prefiltering approach with it at the start of an experiment and still never load more than some pages worth of data.

Horsmann commented 7 years ago

@reckart Do you have a NLP-example for a setup how you use DL4j? I looking for rather simple and straight forward examples for integration/testing. I have now one working example of a document classification based on one of the examples in the examples-project of DL4j but I would like to have something sequence-classification-ish, too - do you have code lying around that you could give away?

reckart commented 7 years ago

@Horsmann check out this one: https://github.com/dkpro/dkpro-core-examples/commit/478eb2b6dc7d365c5f913f9a9d5c65a336f1e038

Horsmann commented 7 years ago

I am going to merge the deep learning changes soon into the master branch. I will tackle test-cases for the frameworks in an own issue.

We have now essentially 2 vectorization modi (i) vectorize to word and (ii) vectorize to integer. You can tell TC to perform the integer mapping and save the step of mapping yourself to integer and back. (i) is probably what makes more sense for dl4j since there is a lot code available already that assumes you have words (ii) makes more sense for keras where people tend to have to write the mapping themselves.

I currently have two processing setups - one that supports document classification and sequence classification. I have omitted unit classification at the moment since this is essentially document classification.

You are free to execute the examples. The deep learning examples should run this is Java/Maven for the keras/dynet examples you need a setup locally installed.

reckart commented 7 years ago

I'm curious. Why does adding NN support require deleting a ton of unit tests?

Horsmann commented 7 years ago

I am brute-force unit-testing how to get docker working when running in a jenkins environment on Ubuntu. This is quite nasty and I just don't want to wait 20 minutes every time I change something so I temporarily deleted all other test cases. I intend to revert that if this docker stuff ever works.

reckart commented 7 years ago

You could just @Ignore them. I worry you are losing the commit history on the deleted tests.

Horsmann commented 7 years ago

@reckart I re-added them already ;-). I think I found the issue with dockers. Getting this to run as JUnit test in a Jenkins job is an extremely huge pain when the development system is OSX and the Jenkins runs on Linux.

Regarding the UKP Jenkins:

@reckart could you execute this script on your Jenkins https://github.com/moby/moby/blob/master/contrib/check-config.sh to see if the system has all mandatory requirements available for dockers?

I prepared a custom docker image which contains additionally DyNet (about 3GB) in size that I use in my test case. Can I put this file onto a temporary storage and you download and install it on the UKP Jenkins? Or which procedure would you prefer?

Horsmann commented 7 years ago

I thought I found the problem but I didn't :/. It looks like I have some timing issues sometimes the test passes. This looks extreme unstable. In the worth case no unit tests for Keras and test he TC code via DL4J only

reckart commented 7 years ago

Where do you think the timing issue comes from?

Horsmann commented 7 years ago

@reckart Not sure. After I deleted the other test cases it originally seemed that tear down is executed premature or that the API which should block does not block. I removed/out-commented tear down and the test passes. Leaving of course active container behind on Jenkins which should not be but at least no crashes caused by the API. Now, after re-adding the other test cases I start getting exceptions again. This is strange. It think I am still far away from understanding what the underlying issue is. Testing against a Jenkins is just a huge pain. Non of these issues come up on my dev-system OSX, everything works just fine here.

reckart commented 7 years ago

I prepared a custom docker image which contains additionally DyNet (about 3GB) in size that I use in my test case. Can I put this file onto a temporary storage and you download and install it on the UKP Jenkins? Or which procedure would you prefer?

Can you put it to Dockerhub?

dkpro / dkpro-tc