NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.84k stars 900 forks source link

Suggestions for MatchZoo 2.0 #154

Closed aneesh-joshi closed 5 years ago

aneesh-joshi commented 6 years ago

Anybody wanting to make suggestions for MZ 2.0, please add it in this issue.

Here are my suggestions:

More details at https://github.com/faneshion/MatchZoo/issues/106

bwanglzu commented 6 years ago

We're going to give up Python 2, how do you think?

By the end of this year, both pandas, numpy and a series of libraries will not support 2.* as well.

thiziri commented 6 years ago

It's okay, in my case, I run MatchZoo with Python 3.5 and there is no problem :+1:

bwanglzu commented 6 years ago

An example use case for 2.0:

import matchzoo as mz

# Train a network.
queries, documents, labels = load_training_data()
# Fit & Pre-processing
queries, documents, labels = mz.fit_transform(queries, documents, labels)
# Initialize a Deep Semantic Structured Model.
dssm_model = mz.DSSM()
dssm_model.compile(optimizer='sgd', learning_rate=0.01, metrics=['accuracy'])
# Train model, support **kwargs such as num_epochs.
dssm_model.fit([queries, documents], labels)
dssm_model.save('your-model-save-path.h5')

import matchzoo as mz

# Make prediction.
# Load test data to be predicted.
queries, documents = load_test_data()
# Apply the same fit_transform as training.
queries, documents = mz.fit_transform(queries, documents)
# Initialize a Deep Semantic Structured Model.
dssm_model = mz.DSSM()
# Load pre-trained model.
dssm_model = dssm_model.load('your-model-save-path.h5')
dssm_model.predict([queries, documents])
uduse commented 6 years ago

Add docstrings for all functions and classes

In our quality control pipeline already.

Make MZ OS independent

Mostly python's job. All we need to do is to be aware of using OS dependent lib calls.

Make MZ usable by providing custom data

1.0 already has it, just kinda difficult. I guess you want it easier, and that's our plan.

Allow External Benchmarking

Benchmarking what?

docker, conda, virtualenv support (wishlist)

I think this one is really important. Personally I use pipenv while developing, but that's not the group's decision. The group's decision is just to satisfy the CI for now, and will come back to env later.

aneesh-joshi commented 6 years ago

Mostly python's job. All we need to do is to be aware of using OS dependent lib calls.

The bash files like run_data.sh cannot work with windows. All bash scripts (and there are quite a few) will have to be converted to python. Or we can provide a bash and .bat file. This seems like too much though. @uduse

aneesh-joshi commented 6 years ago

1.0 already has it, just kinda difficult. I guess you want it easier, and that's our plan.

This should be as easy as

query_term_vector = dssm_model.translate("When was Abraham Lincoln born?")
doc_term_vector = dssm_model.translate("AL was born in xxx")
# query_term_vector : [[10, 15, 75, ...]
# doc_term_vector : [[63, 35, 5, ...]

print(dssm_model.predict_similarity(query_term_vector, doc_term_vector))
# 0.9
bwanglzu commented 6 years ago

Propose new model: Learning Text Similarity with Siamese Recurrent Networks

An tensorflow implementation: https://github.com/dhwajraj/deep-siamese-text-similarity

uduse commented 6 years ago

@aneesh-joshi It will be mostly python, at least users won't need to touch os dependent scripts. And yes, it will be easy like that.

bwanglzu commented 6 years ago

Update June 12, 2018:

What we have done for MZ 2.0

  1. We created CONTRIBUTING.md to guide users make contribution.
  2. We integrated CodeCov for MZ 2.0 Development. For each Pull Request, after Continuous Integration passed, CodeCov will check unit test coverage.
  3. We set up documentation server on readthedocs link: https://matchzoo.readthedocs.io/en/2.0/ . In the future, documentation will be generated based on sphinx style doc string.
  4. We created first version of base_model servers as the base class for all MZ 2.0 models.
  5. We created base_task, 2 child class inherit from base task including Classification and Ranking, different tasks can use different metrics and losses functions.
  6. We created processors.py, which serves as a toolkit library for text preprocessing.

What we are doing now for MZ 2.0

  1. We are implementing three models including dssm (reviewing), cdssm, matchpyramid at the same time.
  2. We are developing base_transformer that is able to transform user input to expected model input, and fit parameters (reviewing).
  3. We're categorizing hyper parameters for each model to be developed, the hyper parameter name should be use the same convension, or unified.
  4. We're adding hyper-parameter optimization functions to MZ 2.0, similar as GridSearchCV in Sklearn

Future Work

  1. Implement more models.
  2. Implement base_layer as abstract base layer, and child layers.
  3. Implement model-wise fit-transformers.
  4. Finalize documentation.
  5. Create website with Github Pages.
bwanglzu commented 6 years ago

Update June 19, 2018

We decoupled list of models in a separate repository: awaresome-neural-models-for-semantic-match, if you discovered nice papers related to neural semantic matching, feel free to send a PR.

aneesh-joshi commented 6 years ago

Hey guys, Another suggestion: We need save load functionality in MZ like that of keras. This is a bit challenging to use keras' model.save because we have Lambda layers which don't serialize so well. Although keras claims it has been resolved, I am still unable to save it. Maybe somebody can check if saving keras can save lambda or eliminate lambda all together. We also can't just pickle the MZ model as Keras documentation states :

It is not recommended to use pickle or cPickle to save a Keras model.

Moreover, a lot of functions have unnecessary Lambda functions. for example, a lamda layer for just adding softmax activation to a dense layer. This can be done by just setting the parameter to the Dense function. Is there an obvious reason I am missing here? Let me know what you think.

uduse commented 6 years ago

@aneesh-joshi Saving a complete keras model is tricky, and I had a lot troubles doing so as well. However, there's a work around. Saving model weights rarely fails. If we can make model building process deterministic based on matchzoo model parameters, and save those model parameters as file successfully, then we can avoid the tricky part of saving a model: we just build a model with the exact same structure on the fly then load the weights. State of the optimizer will be lost during that process, but that's much less important than saving model weights correctly.

aneesh-joshi commented 6 years ago

@uduse I was hoping to avoid using parameters, config, etc from saved files in 2.0. In my opinion, it adds extra moving parts and reduces readability. (I might be wrong in this thought.)

Alas, I am also currently struggling with saving keras models with Lambdas and it is a pain If we can't find a better solution, we'll have to move to your solution

uduse commented 6 years ago

@aneesh-joshi reduces readability of what? If you mean the source code, then yes, it does a bit, but doing so only uses about 15 more lines. If you mean the saved file itself, then the readability of the saved file of model.save is essentially zero. However, if we separate weights and parameters, the weights still has zero readability, but the parameter file can be potentially very readable. Now it's a dill file, but we can later wrap it with YAML and make it human friendly. Also, have a separated parameter file makes it easier for others to reproduce your work.

aneesh-joshi commented 6 years ago

@uduse I meant source code readability. What you say makes sense. I am not sure I completely understand you. Once the model is trained, it has:

  1. The keras trained model which include a.) weights b.) network topology c.) optimizer state
  2. MZ's metadata like a.) word-index dicts b.) ??

You are suggesting we save the keras model's weights as pickle as provided by keras' model.save and then use dill for the network topology and later wrap it in YAML to make it readable? (I am not too familiar with dill and YAML)? What about the MZ metadata?

If a user wants to use the saved model, he'll probably have to run a script, or a function which will collect these different saved entities, build the network topology, put in the weights and the additional word-dict translation.

Do I understand correctly?

uduse commented 6 years ago

The keras trained model which include a.) weights b.) network topology c.) optimizer state

This part is correct. What I mean is to save weights as a file through keras.models.Model.save_weights and re-build the network topology (model structure) with the parameters, as long as the re-build process is deterministic, we can create a model that looks exact the same as the original one so it can load the weights file without any problem.

MZ's metadata like a.) word-index dicts b.) ??

It's not a part of a model. Either those got their own save method or the users have to do it themselves.

You are suggesting we save the keras model's weights as pickle as provided by keras' model.save and then use dill for the network topology and later wrap it in YAML to make it readable? (I am not too familiar with dill and YAML)?

Currently we use dill for the network topology, it's not really the topology itself though, but we can re-build the exact same topology with the same parameter. The parameter table is like a simple dictionary. If there's only primitive data types, then it can be directly converted to a YAML file without problem, and when you edit the YAML file, the correct change will reflect in the loading process as well. The problem is that the parameter table doesn't only contains primitive data types, it can also have things like the optimizer instance, lambdas, e.t.c., but it is definitly doable, just need more hackings.

I recommend reading the code here.

bwanglzu commented 6 years ago

Update June 25, 2018

What has been done:

  1. We finished new base_model.
  2. We finished hyper parameter tuner.
  3. We created ProcessorUnit (including several child processors) for data pre-processing.
  4. We finished DataPack implementation.
  5. We finished DSSM model
bwanglzu commented 6 years ago

Update July 25, 2018

  1. We finalized all DSSM required ProcessorUnits.
  2. We defined base_preprocessor class, all model-wise processors will inherit from this class.
  3. We designed first version of base_generator, all generators will inherit from this class.
  4. We created our first point-generator.
  5. We're working on first integration test: connect all the components together, from pre-processing to model evaluation.
  6. We solved several small bugs.
  7. We're keep working on documentation (both English & Chinese version).
  8. We're experimenting to create our website with github.io.
  9. We'll release the first quick-start, and ask user to give feedback next week.
bwanglzu commented 6 years ago

Update Aug 07, 2018

  1. We finished point-generator
  2. We're working on pair-wise generator.
  3. We finished DSSM integration test and a lot of bug fix.
  4. We created Matchzoo on Pypi, now you're able to install Matchzoo using pip install matchzoo.
  5. We craeted a quick-start guide for users.

Try out MatchZoo 2.0 API HERE!

bwanglzu commented 6 years ago

update Aug 23 2018

  1. We finished CDSSM.
  2. We finished logger, metrics.
  3. Code refactoring on datapack and dssm preprocessor.
  4. Adjust the based model to generator.
  5. Some fix on documentation, setup, type hints.
  6. Code refactoring on tasks.
  7. Example Jupyter notebook for DSSM Model.

We're working on:

  1. Re-implement pair-wise generator.
  2. Adjust metrics
  3. CDSSM preprocessor.
  4. Arc1 Model, process units and pre-processor.
  5. loss.py.
bwanglzu commented 6 years ago

Update Sep 16 2018

  1. We finished Arc-I model.
  2. We finished the pair-wise generator
  3. We finished the list-wise generator (reviewing)
  4. We finished Embedding class to load word embedding.
  5. Some bug fix and dependency update.
  6. We finished ranking losses (reviewing).
  7. We finished integration test for pair-wise & list-wise generator.

We're working on:

  1. CDSSM preprocessor.
  2. CDSSM integration test.
  3. Arc-I preprocessor.
  4. Arc-I Integration test.
bwanglzu commented 6 years ago

Update Oct 07 2018

  1. CDSSM Preprocessor Done
  2. ArcI preprocessor Done
  3. Fix CI issues.
  4. Fix docstring issues.
  5. Loss function refactoring.
  6. Base preprocessor refactoring.

Working on:

  1. Refactor dssm/cdssm/arci preprocessor.
  2. Update metrics
  3. finish integration test.
  4. Implement Bimpm model.
  5. Refactor generator api.
bwanglzu commented 6 years ago

Update Oct 12 2018

  1. CDSSM Preprocessor refactor done & merged.
  2. Arc1 Preprocessor refactor done & merged.
  3. Arc1 integration test done & merged.
  4. CDSSM integration test done not merged.
  5. DSSM tutorial update, merged.
  6. Bug fix on generator, merged.
  7. Base Preprocessor Refactor, merged.
  8. metrics refactor working.
  9. BiMPM Multi-Perspective layer working.
  10. Some organization stuff.

We're working on:

  1. Add CDSSM tutorial.
  2. Add Arc1 tutorial.
  3. Refactor metrics.
  4. Create website.
  5. Some organization issues.
  6. Parameter tuning for existent models.
uduse commented 5 years ago

Closed due to inactivity.