Closed Horsmann closed 6 years ago
Regarding (1) I think it is most reasonable to have Java facade and do all work on the Python side. We would have to agree on a data format for document/unit
and sequence
classification i.e. we write a fixed output format to disc and the Python
scripts would then transform things into a N-dimensional numpy
array.
Any opinions on this matter?
Can you break down "all" the things that could be done on the Java/Python side into specific steps? That would probably generate more feedback about which particular steps people would like to see either on the Java or Python side.
Hi Tobias,
I am just curious and you’ve surely had many discussion on this topic already - discussions which I’ve missed - but why Keras? There is Deeplearning4J (http://deeplearning4j.org/), a Java-based DL framework, which also open-source. So, I am wondering what the reasons were for going with Keras?
Cheers,
Martin
Am 17.09.2016 um 18:46 schrieb Tobias Horsmann notifications@github.com:
Regarding (1) I think it is most reasonable to have Java facade and do all work on the Python side. We would have to agree on a data format for document/unit and sequence classification i.e. we write a fixed output format to disc and the Python scripts would then transform things into a N-dimensional numpy array.
Any opinions on this matter?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-tc/issues/388#issuecomment-247787591, or mute the thread https://github.com/notifications/unsubscribe-auth/AC92L8WGHqaPJL6N3pXKwVfXnv8vN6epks5qrBlSgaJpZM4J_pMA.
@mwunderlich
Just following the trend that other machine learning projects picked up Keras instead of DL4j. Except for the conversation on the mailing list there was not much discussion so far.
DL4j is still strongly evolving with frequent releases of new versions when I checked the last time (monthly). deeplearning4j-nlp
jumped from version 0.4. in July to 0.6 in September.
This is a bit problematic as we would have to depend on a certain version when releasing and it is more than likely that we would be outdated by that time.
Going with Keras would essentially shield us from being bound to a specific version we would just take what we find on the system. Keras can be updated by the user if necessary. Furthermore, the work of the other ML projects allow peeking into their code which might turn out as helpful :)
Dl4j is something for the future but right now I would say its too early.
"all" the things that could be done on the Java/Python side into specific steps?
For the time being I see essentially two big steps which might break up into smaller ones over time. The only Java/Python choice is the data transformation step how to bridge the two worlds.
@Horsmann I see, thanks a lot for the clarification, Tobias. I agree it makes sense to go with something that is more stable and then look into DL4J maybe later again.
I think the definition of the network should be in Python-land.
Generally, I would tend towards Java preparing the vectors and Keras just reading them. However, there can be a very high redundancy in the vectors and great waste of time and space. Thus, I think that we would probably be better off to have a tabular file format where we define the transformation of columns in to vectors in a programming-language-independent way. E.g. "colum 1 uses embeddings from file X", "column 2 uses 1 hot encoding", etc. I don't know if there already is some suitable file format supporting this kind of binding vector semantics to columns that we could use.
The format should allow us to create further backend implementations, e.g. using DL4J, Factorie, etc.
Note that DL4J is working on a Keras-like wrapper for Scala. This is still in early alpha and I am not sure if any degree of compatibility to Keras/Tensorflow/Theano (e.g. with respect to models) is planned.
Here's the link to the afore-mentioned Keras-like wrapper to Deeplearning4j: https://github.com/deeplearning4j/ScalNet
I will continue this issue, too. The idea is to focus on the preprocessing part and provide tools to bring data in typical data structures that you need for deeplearning.
For instance, seq-to-seq would translate a sequence of tokens and labels into integer array (e.g. PoS tagging). seq-to-label or document classification would create a fixed-size integer array of the words in a document with its corresponding gold-label. This prepared data structure has than to be read and casted into the DL-platform format - numpy in case of keras. If the arrays are available in the right dimensions this is probably easier than to code the mapping and converting into vectors oneself. Another feature would be filtering a provided embedding to only include the words occurring in the train/test setup.
This is the rough functionality I would aim in the first iteration.
The second step would then be prototyping a small round-trip to calling some keras code with data read in TC and receiving the output.
Is this a kind of a generic features -> numeric feature vector translation framework that could be used to connect to any type of DL algo?
I'm thinking specifically of DL4J.
Yes. Actually there is not much more that you can do on the TC side.
I will include a setup for sequence-2-label
such as review-sentiment classification and sequence-2-sequence
e.g. PoS tagging.
The NN construction and how to make the dimensions fit for the respective architecture is left to the user. I expect that the user sticks to the contract of writing me a result file with the prediction/gold labels that I can read - but the DL part is essentially a preprocessing help (only).
At the moment I work mostly with Keras but the effort to add DL4J should be minimalistic. Its just defining a call-stub for TC to start the DL-Code.
@reckart @zesch @daxenberger I am stuck here with the same problem as in the other branch when trying to implement CV. I need information from the initTask
in the inner tasks
. What could somehow work is to hack around and passing the file-system pointer from the initTask
into the tasks that need this information. This certainly will leak through the entire code base with some info being lab-like passed and some via constructors (and countless ugly if-else checks). This would work now but probably bite back at various other occasions.
Maybe a more basic question - is importing the key the way I do it in the previous commit of this post even correct?
I need help to get the CV implemented :|
As mentioned in #403, this should be fixed in DKPro Lab now.
@reckart super! thanks. looks good :)
@Horsmann My experience with embeddings was that it took always a lot of time at the beginning of an experiment to load the embeddings into memory. This was rather annoying, in particular when just running small experiments or trying things out.
We have nice classes - BinaryVectorizer/BinaryWordVectorUtils - in the DKPro Core embeddings module that can read/write embeddings in a format which doesn't require to load the whole embeddings file into memory. It uses memory-mapped access. How about using these in DKPro TC too for working with embeddings?
a lot of time at the beginning of an experiment to load the embeddings into memory
At the moment I have a task which filters the embeddings to only contain tokens occurring in the training data. Depending on the size of the data set this is in my experience usually not more than 30 MB (uncompressed plain text) for large data sets. This loads rather quickly?
doesn't require to load the whole embeddings file into memory
I am bit concerned about interfacing this data format to other DL frameworks. I would want to use the existing classes provided by DL4j or Keras (if available). Especially DL4j provides a lot of stuff already. I don't want to introduce a TC-flavor of DL4j. Otherwise I end up writing countless TC-to-DLframework classes which also makes it more difficult for the users to integrate their code because they are forced to learn the tc-flavor
of this framework.
The glove vectors are quite a bit larger than 30 MB - more like 500 MB. That takes quite a while.
The code in DKPro Core is easily compatible with DL4J. Actually, the history behind it is coming from the DL4J corner. I was annoying by the slow reading of embeddings, so I asked on the DL4J gitter channel about some faster alternatives. @treo, who hangs out a lot on DL4J, then provided the first draft which was a bit refined by me and @carschno. It's small code and you already depend on DKPro Core anyway. It's not really a new framework, just an alternative way of loading embeddings.
@reckart I mean that I collect all occurring tokens, read the embedding one time and throw out all vectors if they do not occur in the data. I work with the Glove vectors and they are rarely bigger than 5MB (certainly depends how large the data set is - but I think this is still pretty fast).
The binary word vector loader that @reckart mentions uses a pretty simple format most of it is just a single low level float array.
You could easily use the same prefiltering approach with it at the start of an experiment and still never load more than some pages worth of data.
@reckart Do you have a NLP-example for a setup how you use DL4j? I looking for rather simple and straight forward examples for integration/testing. I have now one working example of a document classification based on one of the examples in the examples-project
of DL4j but I would like to have something sequence-classification-ish, too - do you have code lying around that you could give away?
@Horsmann check out this one: https://github.com/dkpro/dkpro-core-examples/commit/478eb2b6dc7d365c5f913f9a9d5c65a336f1e038
I am going to merge the deep learning changes soon into the master branch. I will tackle test-cases for the frameworks in an own issue.
We have now essentially 2 vectorization modi (i) vectorize to word and (ii) vectorize to integer. You can tell TC to perform the integer mapping and save the step of mapping yourself to integer and back. (i) is probably what makes more sense for dl4j since there is a lot code available already that assumes you have words (ii) makes more sense for keras where people tend to have to write the mapping themselves.
I currently have two processing setups - one that supports document classification
and sequence classification
. I have omitted unit classification
at the moment since this is essentially document classification.
You are free to execute the examples. The deep learning examples should run this is Java/Maven for the keras/dynet examples you need a setup locally installed.
I'm curious. Why does adding NN support require deleting a ton of unit tests?
I am brute-force unit-testing how to get docker working when running in a jenkins environment on Ubuntu. This is quite nasty and I just don't want to wait 20 minutes every time I change something so I temporarily deleted all other test cases. I intend to revert that if this docker stuff ever works.
You could just @Ignore
them. I worry you are losing the commit history on the deleted tests.
@reckart I re-added them already ;-). I think I found the issue with dockers. Getting this to run as JUnit test in a Jenkins job is an extremely huge pain when the development system is OSX and the Jenkins runs on Linux.
Regarding the UKP Jenkins:
@reckart could you execute this script on your Jenkins https://github.com/moby/moby/blob/master/contrib/check-config.sh
to see if the system has all mandatory requirements available for dockers?
I prepared a custom docker image which contains additionally DyNet
(about 3GB) in size that I use in my test case. Can I put this file onto a temporary storage and you download and install it on the UKP Jenkins? Or which procedure would you prefer?
I thought I found the problem but I didn't :/. It looks like I have some timing issues sometimes the test passes. This looks extreme unstable. In the worth case no unit tests for Keras and test he TC code via DL4J only
Where do you think the timing issue comes from?
@reckart Not sure. After I deleted the other test cases it originally seemed that tear down
is executed premature or that the API which should block does not block. I removed/out-commented tear down
and the test passes. Leaving of course active container behind on Jenkins which should not be but at least no crashes caused by the API.
Now, after re-adding the other test cases I start getting exceptions again. This is strange. It think I am still far away from understanding what the underlying issue is. Testing against a Jenkins is just a huge pain. Non of these issues come up on my dev-system OSX, everything works just fine here.
I prepared a custom docker image which contains additionally DyNet (about 3GB) in size that I use in my test case. Can I put this file onto a temporary storage and you download and install it on the UKP Jenkins? Or which procedure would you prefer?
Can you put it to Dockerhub?
Add keras as deep learning framework. Keras (https://keras.io) is a more easy to use deep learning project which uses Theano or Tensorflow in the background. It is furthermore a Python implementation.
Basic idea is to write an output file to disk that is suited for Keras, run training, retrieve the output and feed it back into TC to evaluate it. At some point we would have to bridge between the Java/Python world.
Questions are (1) how much Java do we want. We can do almost everything on the Python side making a thin Java facade in TC. Doing the opposite is possible too and doing all preparation work for running Keras on the Java side and then only calling Keras.
(2) how to pass through resources i.e. Word embeddings (https://github.com/dkpro/dkpro-core/issues/805)
2017-05-20
Current structure (Train Test)
PreparationTask
EmbeddingTask
VectorizationTask
DeepLearning Task
ToDo / Questions / Notes
DL Frameworks
Jenkins
Testing