Closed ivoliv closed 6 years ago
@ivoliv would you be able to give some exact example of commands you ran and highlight the relevant output you got? I think that would help us better understand your issue.
Hi @schmmd , thanks for the reply.
The config file is almost identical to the allennlp github verison, with the following differences:
"train_data_path": './key-OntoNotesScheme_conll_small.txt',
"validation_data_path": './key-OntoNotesScheme_conll_small.txt',
"iterator": {
"type": "bucket",
"sorting_keys": [["text", "num_tokens"]],
"padding_noise": 0.0,
"batch_size": 10
},
and the train/validation data I'm using is a reformattede Wikicoref datafile
https://drive.google.com/file/d/15LNbpVPMh9op_ZQgevz3tgILKwuy_Lcu/view?usp=sharing
I reformatted Wikicoref to include other fields to be compliant with Ontonotes format. I also made sure there are multiple document indecies, hoping that each document would represent one instance (more on that below).
Notice that i changed the 'batch_size' to ten hoping to affect the outcome, but it still doesn't seem to change anything:
In the code, it seems that this line always returns num_training_batches=1, presumably because len(instances) is always 1? How do I change that?
I guess I'm not sure how a batch or instance is defined. Is 1 instance = 1 document? 1 batch = collection of instances?
Thanks!
Sorry, forgot about the output side. Basically, when debugging I see that num_training_batches = 1, and in the output:
1it [00:00, 3.57it/s]
1it [00:00, 3.65it/s]
0%| | 0/1 [00:00<?, ?it/s]
That sure looks like there's only one instance in your data. Something is up with how you are reading your data, if you think there really is more than one instance in there.
How is an instance defined? Is an instance equal to a document, as per Ontonotes format (second column)?
Note how data specifies multiple documents:
Los_Angeles_Pierce_College_0 1 9 and XX * - - - - * * * -
Los_Angeles_Pierce_College_0 1 10 CSU XX * - - - - * * * -
Los_Angeles_Pierce_College_0 1 11 schools XX * - - - - * * * -
Los_Angeles_Pierce_College_0 1 12 . XX * - - - - * * * -
Los_Angeles_Pierce_College_1 2 1 Students XX * - - - - * * * -
Los_Angeles_Pierce_College_1 2 2 can XX * - - - - * * * -
Los_Angeles_Pierce_College_1 2 3 pursue XX * - - - - * * * -
Los_Angeles_Pierce_College_1 2 4 any XX * - - - - * * * -
Los_Angeles_Pierce_College_1 2 5 of XX * - - - - * * * -
Is this how one would define multiple instances?
I don't know enough about how our coref dataset reader expects its data; @DeNeutoy knows more than I do about that. But, I'd suggest going through the code and seeing how it's processing your data, to figure out what's going on. I'd probably start here: https://github.com/allenai/allennlp/blob/5512a8fbef6a95e84712a82959783e81b970e145/allennlp/data/dataset_readers/coreference_resolution/conll.py#L87-L103
This docstring might also be helpful: https://github.com/allenai/allennlp/blob/5512a8fbef6a95e84712a82959783e81b970e145/allennlp/data/dataset_readers/coreference_resolution/conll.py#L53-L57
After going through that and seeing what's going on in your case, if you have more questions, feel free to come back and ask them.
Hi. Thanks! Starting from conll.py and stepping through the code I found this:
I needed to be sure to add "#end document" within the files. That did it!
Thanks again.
Hi,
I've been struggling to get my AllenNLP coref model to adapt to smaller batch sizes.
I have been able to finetune the original model by further training with wikicoref, but it seems to always produce 2 large batches with about half the new data in each, which isn't ideal for memory consumption and rules out the use of GPUs. I've set the iterator 'batch_size' option from 1 to larger values, but that doesn't seem to make a difference (not sure why '1' would be the default). I've tried indexing documents at a finer level, also no effect.
How do I control the batch size for this training task?
Thanks.