When using the remove_stopwordsfunction, if your text column has empty values, nlpretext will raise inconsistent exceptions(about language choice).
🔬 How To Reproduce
Steps to reproduce the behavior:
load data, convert to DataFrame, concatenate the two text columns without a space between them. some rows will be empty.
Try using remove_stopwords
Code sample
import pandas as pd
from nlpretext.basic.preprocess import remove_stopwords
data = {'overview': {
0: 'Comme les Mousquetaires dont elles possèdent le cran',
1: 'New York, été 1977. Alors que la ville connait une canicule historique, un tueur en série, The Son of Sam, frappe dans le quartier italo-américain de South Bronx.',
2: '',
3: "Félicia, dix-sept ans, traverse la mer d'Irlande, avec pour tout renseignement le nom de la ville où habite son amant pour lui annoncer sa grossesse.",
4: "Arthur Bishop pensait qu'il avait mis son passé de tueur à gages derrière lui. Il coule maintenant des jours heureux avec sa compagne dans l'anonymat."},
'tagline': {0: '', 1: '', 2: '', 3: '', 4: 'Il reprend du service.'}
}
data = pd.DataFrame(data)
data["text"] = data["tagline"] + data["overview"]
data["text"].map(lambda x: remove_stopwords(x, lang='fr'))
Environment
OS: google colab
Python version: 3.7
Screenshots
First exception:
Then when replacing 'fr' by 'fr_scpacy':
📈 Expected behavior
remove the stopwords without errors (convert nans to string ?), or get an excpetion saying "your text colum contains Nans, please fix it"
📎 Additional context
Workaround: data["text"] = data["tagline"] + " " + data["overview"] solves it as all rows will be non-empty strings.
🐛 Bug Report
When using the
remove_stopwords
function, if your text column has empty values, nlpretext will raise inconsistent exceptions(about language choice).🔬 How To Reproduce
Steps to reproduce the behavior:
load data, convert to DataFrame, concatenate the two text columns without a space between them. some rows will be empty.
Try using remove_stopwords
Code sample
Environment
Screenshots
First exception:
Then when replacing 'fr' by 'fr_scpacy':
![Capture d’écran 2022-03-22 à 15 53 00](https://user-images.githubusercontent.com/33326907/159511388-7524cc17-3b66-46fc-8c74-41592bb258d3.png)
📈 Expected behavior
remove the stopwords without errors (convert nans to string ?), or get an excpetion saying "your text colum contains Nans, please fix it"
📎 Additional context
Workaround:
data["text"] = data["tagline"] + " " + data["overview"]
solves it as all rows will be non-empty strings.