Describe the bug
Input tokens are truncated to <512 tokens long to fit into BERT. When this happens, the mask associated with that sample appears to not also be truncated. This causes an error to be thrown downstream when the mask is used; in my case, this happens with a BiDAF model running on SQuAD over BERT tokenization/embeddings.
The specific line I think is causing the issue is in wordpiece_indexer.py here:
mask = [1 for _ in tokens]
To fix the mask size / downstream BiDAF matrix operations, I can replace that line with the following:
mask = [1 for _ in offsets]
However, I am not sure about the correctness of this fix w.r.t other aspects of AllenNLP or in general.
To Reproduce
Add the following test fixture to allennlp/tests/fixtures/data/squad.long_passage.jsonnet:
{
"data": [
{
"title": "Hunting",
"paragraphs": [
{
"context": "There is a very active tradition of hunting of small to medium-sized wild game in Trinidad and Tobago. Hunting is carried out with firearms, and aided by the use of hounds, with the illegal use of trap guns, trap cages and snare nets. With approximately 12,000 sport hunters applying for hunting licences in recent years (in a very small country of about the size of the state of Delaware at about 5128 square kilometers and 1.3 million inhabitants), there is some concern that the practice might not be sustainable. In addition there are at present no bag limits and the open season is comparatively very long (5 months - October to February inclusive). As such hunting pressure from legal hunters is very high. Added to that, there is a thriving and very lucrative black market for poached wild game (sold and enthusiastically purchased as expensive luxury delicacies) and the numbers of commercial poachers in operation is unknown but presumed to be fairly high. As a result, the populations of the five major mammalian game species (red-rumped agouti, lowland paca, nine-banded armadillo, collared peccary, and red brocket deer) are thought to be quite low (although scientifically conducted population studies are only just recently being conducted as of 2013). It appears that the red brocket deer population has been extirpated on Tobago as a result of over-hunting. Various herons, ducks, doves, the green iguana, the gold tegu, the spectacled caiman and the common opossum are also commonly hunted and poached. There is also some poaching of 'fully protected species', including red howler monkeys and capuchin monkeys, southern tamanduas, Brazilian porcupines, yellow-footed tortoises, Trinidad piping guans and even one of the national birds, the scarlet ibis. Legal hunters pay very small fees to obtain hunting licences and undergo no official basic conservation biology or hunting-ethics training. There is presumed to be relatively very little subsistence hunting in the country (with most hunting for either sport or commercial profit). The local wildlife management authority is under-staffed and under-funded, and as such very little in the way of enforcement is done to uphold existing wildlife management laws, with hunting occurring both in and out of season, and even in wildlife sanctuaries. There is some indication that the government is beginning to take the issue of wildlife management more seriously, with well drafted legislation being brought before Parliament in 2015. It remains to be seen if the drafted legislation will be fully adopted and financially supported by the current and future governments, and if the general populace will move towards a greater awareness of the importance of wildlife conservation and change the culture of wanton consumption to one of sustainable management.",
"qas": [
{
"answers": [
{
"answer_start": 254,
"text": "12,000"
}
],
"question": "Approximately how many sport hunters applied for hunting licences in recent years?",
"id": "57345f9c879d6814001ca57c"
},
{
"answers": [
{
"answer_start": 82,
"text": "Trinidad and Tobago"
}
],
"question": "Where is there a very active tradition of hunting of small to medium-sized wild game?",
"id": "57345f9c879d6814001ca57b"
},
{
"answers": [
{
"answer_start": 784,
"text": "poached wild game"
}
],
"question": "What is there a very lucrative and thriving black market for?",
"id": "57345f9c879d6814001ca57d"
},
{
"answers": [
{
"answer_start": 707,
"text": "high"
}
],
"question": "What is hunting pressure from?",
"id": "57345f9c879d6814001ca57e"
},
{
"answers": [
{
"answer_start": 36,
"text": "hunting of small to medium-sized wild game"
}
],
"question": "What very active tradition Trinidad and Tabago have?",
"id": "57363183012e2f140011a1fb"
},
{
"answers": [
{
"answer_start": 165,
"text": "hounds"
}
],
"question": "What animal aids in the hunting?",
"id": "57363183012e2f140011a1fc"
},
{
"answers": [
{
"answer_start": 1287,
"text": "red brocket deer"
}
],
"question": "What population has extirpated?",
"id": "57363183012e2f140011a1fd"
},
{
"answers": [
{
"answer_start": 1790,
"text": "very small fees"
}
],
"question": "What do hunters pay to obtain hunting license?",
"id": "57363183012e2f140011a1fe"
}
]
}
]
}
]
}
Then, run the following training config (e.g. from training_config/bidaf_bert.debug.long_passage.jsonnet):
Describe the bug Input tokens are truncated to <512 tokens long to fit into BERT. When this happens, the mask associated with that sample appears to not also be truncated. This causes an error to be thrown downstream when the mask is used; in my case, this happens with a BiDAF model running on SQuAD over BERT tokenization/embeddings.
The specific line I think is causing the issue is in wordpiece_indexer.py here:
To fix the mask size / downstream BiDAF matrix operations, I can replace that line with the following:
However, I am not sure about the correctness of this fix w.r.t other aspects of AllenNLP or in general.
To Reproduce Add the following test fixture to allennlp/tests/fixtures/data/squad.long_passage.jsonnet:
Then, run the following training config (e.g. from training_config/bidaf_bert.debug.long_passage.jsonnet):
Expected behavior I expect no error to be thrown at this line in bidaf.py when encoding the passage using the BERT embeddings and associated masks.
For this to happen, I believe each batch's passage mask size should match the passage tokens-offsets size.
System: