artefactory / NLPretext

All the goto functions you need to handle NLP use-cases, integrated in NLPretext
https://artefactory.github.io/NLPretext/
Apache License 2.0
138 stars 13 forks source link

Inconsistent exception is raised when series containing Nans is passed ro `nlpretext.basic.preprocess.remove_stopwords` #205

Closed julesbertrand closed 1 year ago

julesbertrand commented 2 years ago

🐛 Bug Report

When using the remove_stopwordsfunction, if your text column has empty values, nlpretext will raise inconsistent exceptions(about language choice).

🔬 How To Reproduce

Steps to reproduce the behavior:

  1. load data, convert to DataFrame, concatenate the two text columns without a space between them. some rows will be empty.

  2. Try using remove_stopwords

Code sample

import pandas as pd
from nlpretext.basic.preprocess import remove_stopwords

data = {'overview': {
  0: 'Comme les Mousquetaires dont elles possèdent le cran',
  1: 'New York, été 1977. Alors que la ville connait une canicule historique, un tueur en série, The Son of Sam, frappe dans le quartier italo-américain de South Bronx.',
  2: '',
  3: "Félicia, dix-sept ans, traverse la mer d'Irlande, avec pour tout renseignement le nom de la ville où habite son amant pour lui annoncer sa grossesse.",
  4: "Arthur Bishop pensait qu'il avait mis son passé de tueur à gages derrière lui. Il coule maintenant des jours heureux avec sa compagne dans l'anonymat."},
 'tagline': {0: '', 1: '', 2: '', 3: '', 4: 'Il reprend du service.'}
}

data = pd.DataFrame(data)

data["text"] = data["tagline"] +  data["overview"]

data["text"].map(lambda x: remove_stopwords(x, lang='fr'))

Environment

Screenshots

First exception: Capture d’écran 2022-03-22 à 15 52 41 Then when replacing 'fr' by 'fr_scpacy': Capture d’écran 2022-03-22 à 15 53 00

📈 Expected behavior

remove the stopwords without errors (convert nans to string ?), or get an excpetion saying "your text colum contains Nans, please fix it"

📎 Additional context

Workaround: data["text"] = data["tagline"] + " " + data["overview"] solves it as all rows will be non-empty strings.

github-actions[bot] commented 2 years ago

Hello @julesbertrand, thank you for your interest in our work!

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.