Closed aneesh-joshi closed 5 years ago
We're going to give up Python 2, how do you think?
By the end of this year, both pandas, numpy and a series of libraries will not support 2.* as well.
It's okay, in my case, I run MatchZoo with Python 3.5 and there is no problem :+1:
An example use case for 2.0:
import matchzoo as mz
# Train a network.
queries, documents, labels = load_training_data()
# Fit & Pre-processing
queries, documents, labels = mz.fit_transform(queries, documents, labels)
# Initialize a Deep Semantic Structured Model.
dssm_model = mz.DSSM()
dssm_model.compile(optimizer='sgd', learning_rate=0.01, metrics=['accuracy'])
# Train model, support **kwargs such as num_epochs.
dssm_model.fit([queries, documents], labels)
dssm_model.save('your-model-save-path.h5')
import matchzoo as mz
# Make prediction.
# Load test data to be predicted.
queries, documents = load_test_data()
# Apply the same fit_transform as training.
queries, documents = mz.fit_transform(queries, documents)
# Initialize a Deep Semantic Structured Model.
dssm_model = mz.DSSM()
# Load pre-trained model.
dssm_model = dssm_model.load('your-model-save-path.h5')
dssm_model.predict([queries, documents])
Add docstrings for all functions and classes
In our quality control pipeline already.
Make MZ OS independent
Mostly python's job. All we need to do is to be aware of using OS dependent lib calls.
Make MZ usable by providing custom data
1.0 already has it, just kinda difficult. I guess you want it easier, and that's our plan.
Allow External Benchmarking
Benchmarking what?
docker, conda, virtualenv support (wishlist)
I think this one is really important. Personally I use pipenv
while developing, but that's not the group's decision. The group's decision is just to satisfy the CI for now, and will come back to env later.
Mostly python's job. All we need to do is to be aware of using OS dependent lib calls.
The bash files like run_data.sh
cannot work with windows.
All bash scripts (and there are quite a few) will have to be converted to python.
Or we can provide a bash and .bat file. This seems like too much though.
@uduse
1.0 already has it, just kinda difficult. I guess you want it easier, and that's our plan.
This should be as easy as
query_term_vector = dssm_model.translate("When was Abraham Lincoln born?")
doc_term_vector = dssm_model.translate("AL was born in xxx")
# query_term_vector : [[10, 15, 75, ...]
# doc_term_vector : [[63, 35, 5, ...]
print(dssm_model.predict_similarity(query_term_vector, doc_term_vector))
# 0.9
Propose new model: Learning Text Similarity with Siamese Recurrent Networks
An tensorflow implementation: https://github.com/dhwajraj/deep-siamese-text-similarity
@aneesh-joshi It will be mostly python, at least users won't need to touch os dependent scripts. And yes, it will be easy like that.
Update June 12, 2018:
What we have done for MZ 2.0
What we are doing now for MZ 2.0
Future Work
Update June 19, 2018
We decoupled list of models in a separate repository: awaresome-neural-models-for-semantic-match, if you discovered nice papers related to neural semantic matching, feel free to send a PR.
Hey guys,
Another suggestion:
We need save load functionality in MZ like that of keras.
This is a bit challenging to use keras' model.save
because we have Lambda layers which don't serialize so well.
Although keras claims it has been resolved, I am still unable to save it.
Maybe somebody can check if saving keras can save lambda or eliminate lambda all together.
We also can't just pickle the MZ model as Keras documentation states :
It is not recommended to use pickle or cPickle to save a Keras model.
Moreover, a lot of functions have unnecessary Lambda functions. for example, a lamda layer for just adding softmax activation to a dense layer. This can be done by just setting the parameter to the Dense function. Is there an obvious reason I am missing here? Let me know what you think.
@aneesh-joshi Saving a complete keras model is tricky, and I had a lot troubles doing so as well. However, there's a work around. Saving model weights rarely fails. If we can make model building process deterministic based on matchzoo model parameters, and save those model parameters as file successfully, then we can avoid the tricky part of saving a model: we just build a model with the exact same structure on the fly then load the weights. State of the optimizer will be lost during that process, but that's much less important than saving model weights correctly.
@uduse I was hoping to avoid using parameters, config, etc from saved files in 2.0. In my opinion, it adds extra moving parts and reduces readability. (I might be wrong in this thought.)
Alas, I am also currently struggling with saving keras models with Lambdas and it is a pain If we can't find a better solution, we'll have to move to your solution
@aneesh-joshi reduces readability of what? If you mean the source code, then yes, it does a bit, but doing so only uses about 15 more lines. If you mean the saved file itself, then the readability of the saved file of model.save
is essentially zero. However, if we separate weights and parameters, the weights still has zero readability, but the parameter file can be potentially very readable. Now it's a dill file, but we can later wrap it with YAML and make it human friendly. Also, have a separated parameter file makes it easier for others to reproduce your work.
@uduse I meant source code readability. What you say makes sense. I am not sure I completely understand you. Once the model is trained, it has:
You are suggesting we save the keras model's weights as pickle as provided by keras' model.save
and then use dill for the network topology and later wrap it in YAML to make it readable? (I am not too familiar with dill and YAML)?
What about the MZ metadata?
If a user wants to use the saved model, he'll probably have to run a script, or a function which will collect these different saved entities, build the network topology, put in the weights and the additional word-dict translation.
Do I understand correctly?
The keras trained model which include a.) weights b.) network topology c.) optimizer state
This part is correct. What I mean is to save weights as a file through keras.models.Model.save_weights
and re-build the network topology (model structure) with the parameters, as long as the re-build process is deterministic, we can create a model that looks exact the same as the original one so it can load the weights file without any problem.
MZ's metadata like a.) word-index dicts b.) ??
It's not a part of a model. Either those got their own save
method or the users have to do it themselves.
You are suggesting we save the keras model's weights as pickle as provided by keras' model.save and then use dill for the network topology and later wrap it in YAML to make it readable? (I am not too familiar with dill and YAML)?
Currently we use dill for the network topology, it's not really the topology itself though, but we can re-build the exact same topology with the same parameter. The parameter table is like a simple dictionary. If there's only primitive data types, then it can be directly converted to a YAML file without problem, and when you edit the YAML file, the correct change will reflect in the loading process as well. The problem is that the parameter table doesn't only contains primitive data types, it can also have things like the optimizer instance, lambdas, e.t.c., but it is definitly doable, just need more hackings.
I recommend reading the code here.
Update June 25, 2018
What has been done:
base_model
.ProcessorUnit
(including several child processors) for data pre-processing.DataPack
implementation.DSSM
modelUpdate July 25, 2018
ProcessorUnits
.base_preprocessor
class, all model-wise processors will inherit from this class.base_generator
, all generators will inherit from this class.point-generator
.Update Aug 07, 2018
pip install matchzoo
.Try out MatchZoo 2.0 API HERE!
update Aug 23 2018
We're working on:
Update Sep 16 2018
We're working on:
Update Oct 07 2018
Working on:
Update Oct 12 2018
We're working on:
Closed due to inactivity.
Anybody wanting to make suggestions for MZ 2.0, please add it in this issue.
Here are my suggestions:
More details at https://github.com/faneshion/MatchZoo/issues/106