dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.56k stars 538 forks source link

Add assert for doc_stride, max_seq_length and max_query_length #1587

Open bartekkuncer opened 2 years ago

bartekkuncer commented 2 years ago

Description

This change adds assert for _docstride, _max_seqlength and _max_querylength relation (args.doc_stride <= args.max_seq_length - args.max_query_length - 3) as incautious setting of them can cause data loss when chunking input features and ultimately significantly lower accuracy.

Example

Without the assert when one sets _max_seqlength to e.g. 128 and keeps default 128 value for _docstride this happens for the input feature of _qasid == "572fe53104bcaa1900d76e6b" when running bash ~/gluon-nlp/scripts/question_answering/commands/run_squad2_uncased_bert_base.sh: image

As you can see we are losing some of the _context_tokensids (in red rectangle) as they are not included in any of the ChunkFeatures due to too high _docstride in comparison to _max_seqlength and user does not get notified even with a simple warning. This can lead to significant accuracy drop as this kind of data losses happen for all input features which do not fit entirely into single chunk.

This change introduces an assert popping when there is a possible data loss and forces the user to set proper/safe values for _docstride, _max_seqlength and _max_querylength.

Error message

image

Chunk from example above with _docstride reduced to 32

image

As you can see when values of _docstride, _max_seqlength and _max_querylength satisfy abovementioned equation no data is lost during chunking and we avoid accuracy loss.

cc @dmlc/gluon-nlp-team