HuggingFace Model Differences

antonyscerri commented 3 years ago

Hi

After spending some time comparing the different outputs between the FairSeq (FS) and HuggingFace (HF) model a couple of things have come to light. Probably the most significant is the HF model config has a parameter with a default value significantly impacting its output, the "no_repeat_ngram_size" is set to 3 which causes it to try avoiding repetitions and ideally should be set to zero (which is comparable to the FS setup). Without this HF can produce bad/invalid output (some of which can cause RuntimeExceptions to be thrown in the decoder). The parameter can be supplied in the model.generate call (no_repeat_ngram_size=0) this would need the utils methods to be modified to handle this as well.

Another difference is the minimum (not so much of an issue) and the maximum generate length limits, by default HF is set to 20 in the method but again set to 62 in the model config, rather than 200 as used in FS. Wrt to the examples given you can override this when calling sample but for things like annotation span generation this needs setting in the utils functions.

One difference that remains seems to be handling of a whitespace before a comma between the encoder/decoder pairs of the different models, which results in FS producing sequences with a space before commas but HF has no space.

On a side note anyone using HF may need to implement their own batching to decode multiple sentences (if using the get_entity_spans_hf type methods this could be implemented within them also). Also truncating input sentence lengths to avoid generating too long an input sequence.

Hopefully that will let people use either setup with nearly comparable results.

Thanks

Tony

antonyscerri commented 3 years ago

Hi

I've now found the cause of a number of the other remaining difference (whitespace and subsequent offset differences). This is due to the tokenizer in Hugging Face having a default behaviour which is not required to keep it the same as FS. When calling it to decode the token ids to a string you can set the following flag "clean_up_tokenization_spaces" to False, again this can be used in the examples directly but is also needed in the appropriate utils methods for HF.

Also to ensure identical results for longer text the max_length should be set to 202 not 200 as i mentioned above as FS treats the eos as extra and there seems to be something else as 201 was not sufficient. With all this i've run it over several thousands of texts (truncating inputs to 255 characters) and got identical outputs from FS and HF.

Tony

nicola-decao commented 3 years ago

Fixed! 🙂

facebookresearch / GENRE

HuggingFace Model Differences #7