Compare ways of chaining two models together

We have three approaches to training two models:

Training two separate models: one to predict, given an input of a sentence with one masked word, if it's a named entity or not. Then the second one, with the same input, only running on instances predicted by the first model as named entities, to predict the class of a given named entity.
One model with all functionality combined, where we predict if a sentence with a masked word is a named entity of one of 6 classes + not a named entity. The suspicion here is that it's better to have two models being good at their own things instead of one jack of all trades, but it might be wrong, and therefore requires a performance comparison.
Sequentially train two models with shared weights, out to a point where we fork them out to make separate predictions (and compare with different loss functions). The hope here is that adding the losses together will prevent overfitting, as the model will try to do good at more than one task. This might be beneficial for the WNUT dataset, which is very different when it comes to training and test sets. However, it might also make the model worse at either of the tasks compared to if it was trained separately for each of them. This has been implemented by @Malthehave in _simlpe_nn.ipynb. What should be inquired is the relative scales of both loss functions, as they are different (binary class vs multiclass) and might produce results of different scales, like punishing the same embedding for -2 or -200 depending on the loss function. If that's the case, adding them might not make sense, and a different approach might increase the performance. Also, the code needs performance metrics at the end, which are not implemented right now.

Hetling / NLP-second-year-project

Compare ways of chaining two models together #1