Documentation for multi-dataset training

stepgazaille commented 2 years ago

Hello!

First of all, thank you so much for publishing this code base. It's a great contribution!

I'm curious about the multi-datasets training feature of DyGIE++.

The model.md file explains how to read evaluation results for models trained on multiple datasets, but sadly the corresponding section from config.md remains to be done.

I've been poking around the code base but so far couldn't find any clue on how to write a configuration file to train a model on multiple datasets (I must admit that although I am more and more familiar with allennlp, I'm not really what you may call a "power user" yet). I also looked at the list of merged PRs and commits and couldn't find anything related.

Is multi-dataset training really integrated to DyGIE++? If it is, could anyone have a good starting point to start learning aboutn this feature? An example configuration file would be amazing, but if no one has that then a reference to a commit of PR would also be a great help.

Thank you for your help!

dwadden commented 2 years ago

Hi,

Multi-dataset training should definitely be possible, apologies if it's not well-documented. I'll look into this over the weekend.

Dave

stepgazaille commented 2 years ago

No worries Thank you for the swift answer and for taking some time to help Have a good one Steph

dwadden commented 2 years ago

OK, I took a look.

The relevant section of the docs is here, which is pretty sparse, and you're right there's no example.

Fortunately, I don't think you actually have to change the config at all. Just create your dataset in jsonl format, as described in data.md, making sure to specify a dataset field for each instance indicating which dataset that instance is part of.

The model will take care of the rest. Let me know if this makes sense. I'll try to clarify the docs at some point - or, if you're willing, feel free to submit a PR with an update to the docs and I'll merge.

stepgazaille commented 2 years ago

Hello David,

Thank you so much for taking time over the weekend to help me with this, it's really appreciated. After reading your last message and re-reading the data.md doc, what I understand is the following: If I want to train a model on datasets A and B, I have to merge both datasets training sets into and a single jsonl file (same goes for the datasets' validation and test sets). So the model's config file stays the same, I just need to update it to use the merged dataset's jsonl files. Is this correct?

dwadden commented 2 years ago

Yep, that's all that should be necessary. The model should do the right thing, including computing different metrics for the different datasets. If that doesn't happen, post here and we can debug.

stepgazaille commented 2 years ago

Ok so I tested with 2 datasets today. Let's call them datasets A and B. Dataset A has labels for events, ner, relation and coref. Dataset B has labels for events, ner and coref. The target task is events. I previsously trained models on those datasets independently without issues. Instances from dataset A use dataset label dataset-a, instances from dataset B use dataset label dataset-b I merged the datasets into a single set of train, valid and test jsonl files. I loaded the merged dataset into instances of dygie.data.dataset_readers.document using the provided notebook and didn't notice anything wrong Nothing special on the configuration side. Here it is actually:

  data_paths: {
    train: 'data/merged/train.jsonl',
    validation: 'data/merged/valid.jsonl',
    test: 'data/merged/test.jsonl',
  },
  loss_weights: {
    ner: 0.5,
    relation: 0.5,
    coref: 0.5,
    events: 1.0
  },
  model +: {
    modules +: {
      coref +: {
        coref_prop: 0
      }
    },
  },
  target_task: "events",
  max_span_width: 12

Do you see anything wrong with the information above? Because when I launch training I get the following exception at the first instance that is processed (first processed instance might be from dataset A or B depending on the run):

'dataset-a__trigger_labels'
  File "dygiepp/dygie/models/events.py", line 273, in _compute_trigger_scores
    trigger_scorer = self._trigger_scorers[self._active_namespaces["trigger"]]
  File "dygiepp/dygie/models/events.py", line 137, in forward
    trigger_embeddings, trigger_mask)
  File "dygiepp/dygie/models/dygie.py", line 282, in forward
    ner_labels, metadata)
  File "dygiepp/scripts/train.py", line 28, in <module>

dwadden commented 2 years ago

Sorry for the slow response, I had a deadline last week. What you're doing looks reasonable. Can attach your training config and data here, and provide the command used to kick off model training? I'll attempt to reproduce the error.

stepgazaille commented 2 years ago

Hello David, Sorry for the delay, I had to discuss with my supervisors before I could move with this. I was allowed to send you samples of the datasets by email. Would it work for you if I sent those to the uni email I can see on your youtube profile page? I thank you again and apologies for the inconvinence, but my hands are tied up on this :/

dwadden commented 2 years ago

Sure, you can send them to dwadden@cs.washington.edu, I won't re-share. I think you meant my GitHub profile rather than my YouTube profile (I don't think I have a YouTube profile)?

No worries, I'm fairly confident this is a bug on my end and it will be good to get it fixed.

stepgazaille commented 2 years ago

Hahaha yes I did mean GitHub and not YouTube. Ok I sent the bug replication data at your @cs.washington.eduemail address Thanks again for taking time to look into this

dwadden commented 2 years ago

This should be fixed now. Give it a try and let me know what happens.

stepgazaille commented 2 years ago

Ok so I pulled the latest version and it looks like the problem is solved! Thank you

One last question: is there a way to execute model selection using one particular dataset/task combination instead of using the average over all datasets for a particular task? For exemple, I'm training a model for event extraction over 2 datasets, but I'd like to select the model which performed the best on one of those 2 datasets because I think the model might start over-fitting on one dataset before the other. Is this model selection feature already implemented in DyGIE++ by any chance?

dwadden commented 2 years ago

OK, glad it worked!

There is a way to do this by modifying the training config. The line you'd need to update from the config is here. I think you can update just a single field using +: syntax, like here, but I'm rusty so you should probably double-check the DyGIE docs or the jsonnet documentation to make sure.

If you want to display a different set of metrics during model training, here's the line you'd need to change. But this is just aesthetic, it doesn't influence model validation behavior.

stepgazaille commented 2 years ago

Great! Thank you so much for everything David. Have a good one.

dwadden commented 2 years ago

Happy to help!

dwadden / dygiepp

Documentation for multi-dataset training #87