allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.75k stars 2.25k forks source link

How are batches created for coreference training #1940

Closed ivoliv closed 6 years ago

ivoliv commented 6 years ago

Hi,

I've been struggling to get my AllenNLP coref model to adapt to smaller batch sizes.

I have been able to finetune the original model by further training with wikicoref, but it seems to always produce 2 large batches with about half the new data in each, which isn't ideal for memory consumption and rules out the use of GPUs. I've set the iterator 'batch_size' option from 1 to larger values, but that doesn't seem to make a difference (not sure why '1' would be the default). I've tried indexing documents at a finer level, also no effect.

How do I control the batch size for this training task?

Thanks.

schmmd commented 6 years ago

@ivoliv would you be able to give some exact example of commands you ran and highlight the relevant output you got? I think that would help us better understand your issue.

ivoliv commented 6 years ago

Hi @schmmd , thanks for the reply.

The config file is almost identical to the allennlp github verison, with the following differences:

  "train_data_path": './key-OntoNotesScheme_conll_small.txt',
  "validation_data_path": './key-OntoNotesScheme_conll_small.txt',

  "iterator": {
    "type": "bucket",
    "sorting_keys": [["text", "num_tokens"]],
    "padding_noise": 0.0,
    "batch_size": 10
  },

and the train/validation data I'm using is a reformattede Wikicoref datafile

https://drive.google.com/file/d/15LNbpVPMh9op_ZQgevz3tgILKwuy_Lcu/view?usp=sharing

I reformatted Wikicoref to include other fields to be compliant with Ontonotes format. I also made sure there are multiple document indecies, hoping that each document would represent one instance (more on that below).

Notice that i changed the 'batch_size' to ten hoping to affect the outcome, but it still doesn't seem to change anything:

In the code, it seems that this line always returns num_training_batches=1, presumably because len(instances) is always 1? How do I change that?

I guess I'm not sure how a batch or instance is defined. Is 1 instance = 1 document? 1 batch = collection of instances?

Thanks!

ivoliv commented 6 years ago

Sorry, forgot about the output side. Basically, when debugging I see that num_training_batches = 1, and in the output:

1it [00:00,  3.57it/s]
1it [00:00,  3.65it/s]
  0%|          | 0/1 [00:00<?, ?it/s]
matt-gardner commented 6 years ago

That sure looks like there's only one instance in your data. Something is up with how you are reading your data, if you think there really is more than one instance in there.

ivoliv commented 6 years ago

How is an instance defined? Is an instance equal to a document, as per Ontonotes format (second column)?

ivoliv commented 6 years ago

Note how data specifies multiple documents:

Los_Angeles_Pierce_College_0 1 9 and XX * - - - - * * * -
Los_Angeles_Pierce_College_0 1 10 CSU XX * - - - - * * * -
Los_Angeles_Pierce_College_0 1 11 schools XX * - - - - * * * -
Los_Angeles_Pierce_College_0 1 12 . XX * - - - - * * * -

Los_Angeles_Pierce_College_1 2 1 Students XX * - - - - * * * -
Los_Angeles_Pierce_College_1 2 2 can XX * - - - - * * * -
Los_Angeles_Pierce_College_1 2 3 pursue XX * - - - - * * * -
Los_Angeles_Pierce_College_1 2 4 any XX * - - - - * * * -
Los_Angeles_Pierce_College_1 2 5 of XX * - - - - * * * -

Is this how one would define multiple instances?

matt-gardner commented 6 years ago

I don't know enough about how our coref dataset reader expects its data; @DeNeutoy knows more than I do about that. But, I'd suggest going through the code and seeing how it's processing your data, to figure out what's going on. I'd probably start here: https://github.com/allenai/allennlp/blob/5512a8fbef6a95e84712a82959783e81b970e145/allennlp/data/dataset_readers/coreference_resolution/conll.py#L87-L103

This docstring might also be helpful: https://github.com/allenai/allennlp/blob/5512a8fbef6a95e84712a82959783e81b970e145/allennlp/data/dataset_readers/coreference_resolution/conll.py#L53-L57

After going through that and seeing what's going on in your case, if you have more questions, feel free to come back and ask them.

ivoliv commented 6 years ago

Hi. Thanks! Starting from conll.py and stepping through the code I found this:

https://github.com/allenai/allennlp/blob/5512a8fbef6a95e84712a82959783e81b970e145/allennlp/data/dataset_readers/dataset_utils/ontonotes.py#L219

I needed to be sure to add "#end document" within the files. That did it!

Thanks again.