biolab / orange3-text

🍊 :page_facing_up: Text Mining add-on for Orange3
Other
127 stars 84 forks source link

Preprocess text: first word in custom stopwords list is ignored #1028

Closed wvdvegte closed 2 months ago

wvdvegte commented 11 months ago

Describe the bug In custom .txt (UTF-8) stopwords files, the first word is ignored as a stopword by Preprocess Text, i.e., it is not filtered out.

To Reproduce Create a custom stopwords .txt file in UTF-8 encoding (in my case, I used MS Word), consisting of words separated by returns, and load it in Preprocess text. The first word will not be filtered out but the rest will. Leaving the first line empty solves the problem, but it's not the obvious thing to do.

Expected behavior All custom stopwords should be filtered out.

Orange version: 3.36.2 (I don't know if it's the native Silicon version or the Intel version)

Text add-on version: 1.15.0

Operating system: Mac OS 14.1.2 (23B92)

ajdapretnar commented 4 months ago

This is an editor issue. When I use Sublime text, the file contains word1\nword2. When I use TextEdit (OSX), the file contains '{\\rtf1\\ansi\\ansicpg1252\\cocoartf2636\n\\cocoatextscaling0\\cocoaplatform0{\\fonttbl\\f0\\fswiss\\fcharset0 Helvetica;}\n{\\colortbl;\\red255\\green255\\blue255;}\n{\\*\\expandedcolortbl;;}\n\\paperw11900\\paperh16840\\margl1440\\margr1440\\vieww11520\\viewh8400\\viewkind0\n\\pard\\tx566\\tx1133\\tx1700\\tx2267\\tx2834\\tx3401\\tx3968\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural\\partightenfactor0\n\n\\f0\\fs24 \\cf0 of\\\nsystem}'. I think MS Word does the same. You could test with:

with open("path/to/file.txt") as f:
    file = f.read()
file

See what you get.

ajdapretnar commented 4 months ago

@markotoplak Is there a way we could sanitize this internally?

janezd commented 4 months ago

@ajdapretnar, I guess you are saving text as rich text format (rtf), not plain text.

@wvdvegte probably has a different problem.

ajdapretnar commented 4 months ago

I thought the reason for not considering the first row for filtering is because in rtf, additional parameters get treated as text. So instead of a plain "orange" one would get "{fancyparam:15}orange" and thus the word would not be filtered.

wvdvegte commented 4 months ago

I was indeed referring to the use of plain text (TXT), not RTF.

ajdapretnar commented 4 months ago

@wvdvegte Could you perhaps send the stopword list? I cannot replicate the issue, so perhaps there's something about the file that is the problem. Thanks!

wvdvegte commented 4 months ago

I didn't manage to dig up what I was working on when I reported on this in December 2023, but when I'm trying to reproduce the problem, I'm not getting any of the custom stopwords filtered out: stopword filtering.zip

ajdapretnar commented 4 months ago

Thank you! Now I've finally managed to reproduce the issue. As I've suspected, it's the editor. The string reads: '\ufeffpig\ncow\nchicken\nhorse\n'. The first character is a BOM, typical for Windows apparently. We can solve this by reading the file with encoding='utf-8-sig'. Will prepare and test the fix.

wvdvegte commented 4 months ago

Typical for Microsoft, perhaps? I created the text file using Word for Mac ...