BlueBrain / Search

Blue Brain text mining toolbox for semantic search and structured information extraction
https://blue-brain-search.readthedocs.io
GNU Lesser General Public License v3.0
42 stars 11 forks source link

[Discussion] What to do with URLs, e-mails, formulas etc. in articles? Is replacing by placeholders OK? #447

Closed Stannislav closed 2 years ago

Stannislav commented 3 years ago

The PR #437 introduced replacement of certain XML tags like <email> etc. by placeholders. Discuss:

Details

This issue was spawned from the following thread in #437:

Isn't it common to do this kind of replacement in NLP?

Not for language modeling (see next comment below).

I think both of these approaches sound reasonable. It would be interesting to see whether/how other people treated emails, urls etc. in the context of NLP problems using transformer-based models. We can have a quick look and then decide what to do accordingly.

➡️ What about opening a dedicated issue and briefly discussing a couple of examples that we can find online there, and then implement a chosen strategy?


Here are my two-cents on the subject. I know very little so I would be glad to change my mind if presented with info going in the opposite direction.

  1. I think this process of replacing sensitive/meaningless info with something else is sometimes referred to as "pseudonymization".
  2. I think have seen this being done also with BERT models. See e.g. this project, where they used "a transformer-based model pretrained on a large corpus of Twitter messages on the topic of COVID-19".

_Originally posted by @FrancescoCasalegno in https://github.com/BlueBrain/Search/pull/437#discussion_r710090707_

pafonta commented 3 years ago

A preliminary result:

According to Muller, Benjamin et al. “Enhancing BERT for Lexical Normalization.” EMNLP (2019):

[BERT] ability to handle noisy inputs is still an open question

pafonta commented 3 years ago

According to Kumar, Ankit et al. "Noisy Text Data: Achilles' Heel of BERT." WNUT (2020):

In this work we find that BERT fails to properly handle OOV words (due to noise). We show that this negatively impacts the performance of BERT on fundamental tasks in NLP when fine-tuned over noisy text data.

➡️ If chosen placeholders are OOV (Out-Of-Vocabulary) words, it will negatively impacts the performance of BERT.

According to the section "previous work" of Bagla, Kartikay et al. "Noisy Text Data: Achilles' Heel of popular transformer based NLP models." ArXiv (2021), no study exists on the effect of removing, replacing, or keeping elements likes URLs, email addresses, or equations / formulas.

➡️ However, all studies show that Transformer-based models have a performance drop on noisy texts.

Stannislav commented 3 years ago

So, maybe the question is then if bare URLs, LaTeX formulas etc. mean more noise or less than the corresponding placeholders. I'm guessing both can be assumed to be OOV.

pafonta commented 3 years ago

Besides here and here, the creator of sentence-transformers said:

You can remove [URLs, emails, or equations]. Some of the training data also had URLs and emails, so I think the models are not too sensitive.

to the question:

For best STS / NLI performance, should URLs, emails, or equations be removed, replaced by a placeholder or kept?

Details are here.

pafonta commented 3 years ago

So, maybe the question is then if bare URLs, LaTeX formulas etc. mean more noise or less than the corresponding placeholders. I'm guessing both can be assumed to be OOV.

Actually, as pointed out by the creator of sentence-transformers, BERT has been pre-trained on data containing URLs, emails, ... So, such elements are less noise than placeholders.

One could check that by checking that, for example, http, www, @ are part of BERT vocabulary here: https://huggingface.co/bert-base-cased/blob/main/vocab.txt.

pafonta commented 3 years ago

Proposed action

Instead of using placeholders (URL, EMAIL, ...), just remove the elements handled in:

code https://github.com/BlueBrain/Search/blob/37c45571831cbec4aa7aca40950129fa0b199af5/src/bluesearch/database/article.py#L392-L399

Please confirm you have read this proposal and made your comments below if needed:

EmilieDel commented 3 years ago

Thanks for the investigation, @pafonta! I am not sure I totally understand why the maintainer of the sentence-transformers repository replied:

You can remove [URLs, emails, or equations]. Some of the training data also had URLs and emails, so I think the models are not too sensitive.

as for me, if the models are not too sensitive, it means we should maybe keep the sentence as is instead of removing completely the [URLs, emails, or equations] but maybe I am wrong. What is your thought about this ?

Stannislav commented 3 years ago

I'm fine with removing. Sadly the answers on the huggingface issue (thanks PA for submitting it) are a bit vague and not much substantiated. But we can try and see what we get.

pafonta commented 3 years ago

Hello @EmilieDel,

as for me, if the models are not too sensitive, it means we should maybe keep the sentence as is instead of removing completely the [URLs, emails, or equations] but maybe I am wrong. What is your thought about this ?

Basically, one could either remove them or leave them as-is (but not replace them with placeholders).

These references conclude that OOV tokens are harmful. Removing these elements prevent from having such OOV tokens.

We can definitely open a ticket to benchmark performances when these elements are removed or not. However, this implies work that is consequent and maybe too early considering the current status of article parsing.

That's why I have proposed to go for his first suggestion, removing them.

EmilieDel commented 3 years ago

Thanks for the reply @pafonta! I also think this is the way to go! :)

pafonta commented 3 years ago

Hello @EmilieDel, @Stannislav, and @jankrepl,

Thank you for the discussion. I have then created the follow-up issue: https://github.com/BlueBrain/Search/issues/502.

Let's wait for the planning to close the current issue. Indeed, @FrancescoCasalegno has not checked his box yet.