szha commented 5 years ago

Hi,

Let's start a discussion here about the roadmap towards 0.10 and 1.0. We are looking for:

New features that are useful to your research
Improvements and patches to existing features

If you have any item that you'd like to propose to have in the roadmap, please do:

Create (or locate existing) issue for the item, note the issue number.
Comment in this issue: 1) the above issue number, 2) one sentence of what the item is about and why it's useful to you.
Indication on whether you'd be willing to help out on the item.

Features

The following features have been proposed to include in GluonNLP 0.7.0 (subject to change)

Models

Language modeling

[ ] ALBERT https://github.com/dmlc/gluon-nlp/issues/955
[x] SciBERT pre-trained models (#650)
[x] Release GluonNLP's pre-trained models (#642)
[ ] Korean BERT https://github.com/dmlc/gluon-nlp/issues/939
[x] GPT/GPT-2 training from scratch (#592) OpenGPT-2
[ ] SRU
[ ] ERNIE
[ ] XLNet training from scratch
[ ] Transformer-XL training from scratch
[ ] ALBERT
[ ] XLM-Roberta

Word Embedding

[ ] Poincaré GloVE Embedding
[ ] [ICLR2019] Adaptive Input Representations for Neural Language Modeling
[ ] [NIPS2017] Poincaré Embeddings for Learning Hierarchical Representations

NER

[x] BERT for NER (#593 #612)

Memory networks and transformer

[ ] Augmenting Self-attention with Persistent Memory
[ ] Large Memory Layers with Product Keys
[ ] Adaptive Attention Span in Transformers (https://www.aclweb.org/anthology/P19-1032)
[ ] Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting (https://arxiv.org/pdf/1907.00235.pdf)
[ ] related: Hybrid computing using a neural network with dynamic external memory
[ ] related: Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes

Visualization

[ ] Visualizing and Measuring the Geometry of BERT
[ ] Understanding Black-box Predictions via Influence Functions
[ ] [NeurIPS2019] Visualizing and Measuring the Geometry of BERT (https://arxiv.org/pdf/1906.02715.pdf)
[ ] [Arxiv2019] Visualizing and Understanding the Effectiveness of BERT (https://arxiv.org/pdf/1908.05620.pdf)
[ ] [Arxiv2019] What Does BERT Look At? An Analysis of BERT’s Attention (https://arxiv.org/pdf/1906.04341.pdf)
[ ] related: Bertviz: https://github.com/jessevig/bertviz

Quantization

[ ] [ICLR2019] Learning Recurrent Binary/Ternary Weights (https://openreview.net/pdf?id=HkNGYjR9FX)
[ ] [JMLR2018] Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations (http://www.jmlr.org/papers/volume18/16-456/16-456.pdf)

Multi-Task/Transfer Learning

[ ] MT-DNN and multi-task learning (#633)
[ ] [ICML2019] Parameter-Efficient Transfer Learning for NLP (http://proceedings.mlr.press/v97/houlsby19a.html)
[ ] [ACL2019] MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension (https://arxiv.org/pdf/1905.13453.pdf)

Machine translation

[ ] Cross-lingual BERT (#657)
[ ] MASS
[ ] [ACL2016] Incorporating Copying Mechanism in Sequence-to-Sequence Learning (https://arxiv.org/pdf/1603.06393.pdf)
[ ] [EMNLP2018] Understanding Back-Translation at Scale (https://aclweb.org/anthology/D18-1045)
[ ] [ICLR2019] Universal Transformers (https://arxiv.org/pdf/1807.03819.pdf)
[ ] [ICLR2019] Pay Less Attention With Lightweight and Dynamic Convolutions (https://arxiv.org/pdf/1901.10430.pdf)
[ ] Document MT: [EMNLP2018] Document-Level Neural Machine Translation with Hierarchical Attention Networks (https://www.aclweb.org/anthology/D18-1325)

Tokenization

[ ] Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Text classification

[ ] Text-matching models (#616)
[ ] Region Embedding (#437)
[ ] Heirarchical Attention Models for Document (HAN)

Topic modeling

[ ] LDA

Knowledge distillation

[ ] [EMNLP2019] Patient Knowledge Distillation for BERT Model Compressions (https://arxiv.org/pdf/1908.09355.pdf)
[ ] [ICLR2020 Submission] TinyBERT

APIs

[ ] fp16 trainer API (#674)
[x] BERT export API (#659, replaces static BERT scripts)
[x] LAMB optimizer (#677)
[x] GLUE datasets (#673)
[x] Vocab special token registration (#572)
[x] Make Vocab unknown token index configurable (#393)
[ ] ROUGE metric (#561)
[ ] subword-nmt BPE support (#374)
[ ] vocab look-up with case-insensitive backoff (#367)
[ ] configurable experiment management
[ ] natural question dataset

Scripts

[ ] ELMo feature extraction script (#595)
[ ] Deep Biaffine Attention Dependency Parser follow-up (#427)

Documentation

[x] Add documentation for return types of datasets and transform functions (#405)

Demos

[ ] Demo application for NER
[ ] Demo application for text classification
[ ] Demo application for question answering
[ ] Demo application for dependency parsing
[ ] Demo application for machine translation
[ ] Demo application for text generation

cc @dmlc/gluon-nlp-team

Related Projects

haven-jeon commented 5 years ago

633 MTL(MultiTask Learning) is one of the most important topic in NLP. Implementing BigBird is good start point to develop MTL in gluon-nlp.

I can make room for this topic.

Ishitori commented 5 years ago

Shall we consider LAMB optimizer - https://arxiv.org/abs/1904.00962?

@szha: #677

szha commented 5 years ago

657 BERT for XNLI/NMT. @fhieber and team expressed interest in this.

vanewu commented 5 years ago

Up to now, gluonnlp has basically perfect components. There are many specific models for the specific tasks of natural language processing. Can we write some standard classic models? Anyone with a related task requirement can call this model directly for a quick experiment.

szhengac commented 5 years ago

Distributed training module may be a good feature as many current SOTA nlp models typically require a lot of GPUs. pytorch/fairseq also supports distributed training across multiple machines.

fierceX commented 5 years ago

I think we can add some text matching models #616

sravanbabuiitm commented 5 years ago

We can look towards adding HAN (Heirarchical Attention Models for Document classification : https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf) in scripts for text classification.
We can add HAN model and fastText models to the d2l.ai book

markusdr commented 5 years ago

JSON/Yaml config files that specify models and experiments would be great. That's a very popular feature of AllenNLP and other toolkits.

https://github.com/dmlc/gluon-nlp/issues/392

For an example AllenNLP config file, see https://github.com/allenai/allennlp/blob/master/tutorials/tagger/experiment.jsonnet

Quote:

This means that most of your experiment can be specified declaratively in a separate configuration file, which serves as a record of exactly what experiments you ran with which parameters. Now you can change various aspects of your model without writing any code. For instance, if you wanted to use a GRU instead of an LSTM, you'd just need to change the appropriate entry in the configuration file.