Closed Stannislav closed 2 years ago
A preliminary result:
According to Muller, Benjamin et al. “Enhancing BERT for Lexical Normalization.” EMNLP (2019):
[BERT] ability to handle noisy inputs is still an open question
According to Kumar, Ankit et al. "Noisy Text Data: Achilles' Heel of BERT." WNUT (2020):
In this work we find that BERT fails to properly handle OOV words (due to noise). We show that this negatively impacts the performance of BERT on fundamental tasks in NLP when fine-tuned over noisy text data.
➡️ If chosen placeholders are OOV (Out-Of-Vocabulary) words, it will negatively
impacts the performance of BERT.
According to the section "previous work" of Bagla, Kartikay et al. "Noisy Text Data: Achilles' Heel of popular transformer based NLP models." ArXiv (2021), no study exists on the effect of removing, replacing, or keeping elements likes URLs, email addresses, or equations / formulas.
➡️ However, all studies show that Transformer-based models have a performance drop
on noisy texts.
So, maybe the question is then if bare URLs, LaTeX formulas etc. mean more noise or less than the corresponding placeholders. I'm guessing both can be assumed to be OOV.
Besides here and here, the creator of sentence-transformers
said:
You can remove [URLs, emails, or equations]. Some of the training data also had URLs and emails, so I think the models are not too sensitive.
to the question:
For best STS / NLI performance, should URLs, emails, or equations be removed, replaced by a placeholder or kept?
Details are here.
So, maybe the question is then if bare URLs, LaTeX formulas etc. mean more noise or less than the corresponding placeholders. I'm guessing both can be assumed to be OOV.
Actually, as pointed out by the creator of sentence-transformers
, BERT
has been pre-trained on data containing URLs, emails, ... So, such elements are less noise than placeholders.
One could check that by checking that, for example, http
, www
, @
are part of BERT
vocabulary here: https://huggingface.co/bert-base-cased/blob/main/vocab.txt.
Instead of using placeholders (URL
, EMAIL
, ...), just remove the elements handled in:
Please confirm you have read this proposal and made your comments below if needed:
Thanks for the investigation, @pafonta! I am not sure I totally understand why the maintainer of the sentence-transformers
repository replied:
You can remove [URLs, emails, or equations]. Some of the training data also had URLs and emails, so I think the models are not too sensitive.
as for me, if the models are not too sensitive, it means we should maybe keep the sentence as is instead of removing completely the [URLs, emails, or equations] but maybe I am wrong. What is your thought about this ?
I'm fine with removing. Sadly the answers on the huggingface issue (thanks PA for submitting it) are a bit vague and not much substantiated. But we can try and see what we get.
Hello @EmilieDel,
as for me, if the models are not too sensitive, it means we should maybe keep the sentence as is instead of removing completely the [URLs, emails, or equations] but maybe I am wrong. What is your thought about this ?
Basically, one could either remove them or leave them as-is (but not replace them with placeholders).
These references conclude that OOV tokens are harmful. Removing these elements prevent from having such OOV tokens.
We can definitely open a ticket to benchmark performances when these elements are removed or not. However, this implies work that is consequent and maybe too early considering the current status of article parsing.
That's why I have proposed to go for his first suggestion, removing them.
Thanks for the reply @pafonta! I also think this is the way to go! :)
Hello @EmilieDel, @Stannislav, and @jankrepl,
Thank you for the discussion. I have then created the follow-up issue: https://github.com/BlueBrain/Search/issues/502.
Let's wait for the planning to close the current issue. Indeed, @FrancescoCasalegno has not checked his box yet.
The PR #437 introduced replacement of certain XML tags like
<email>
etc. by placeholders. Discuss:Details
This issue was spawned from the following thread in #437:
I think both of these approaches sound reasonable. It would be interesting to see whether/how other people treated emails, urls etc. in the context of NLP problems using transformer-based models. We can have a quick look and then decide what to do accordingly.
➡️ What about opening a dedicated issue and briefly discussing a couple of examples that we can find online there, and then implement a chosen strategy?
Here are my two-cents on the subject. I know very little so I would be glad to change my mind if presented with info going in the opposite direction.
_Originally posted by @FrancescoCasalegno in https://github.com/BlueBrain/Search/pull/437#discussion_r710090707_