Closed wvdvegte closed 2 months ago
This is an editor issue. When I use Sublime text, the file contains word1\nword2
. When I use TextEdit (OSX), the file contains '{\\rtf1\\ansi\\ansicpg1252\\cocoartf2636\n\\cocoatextscaling0\\cocoaplatform0{\\fonttbl\\f0\\fswiss\\fcharset0 Helvetica;}\n{\\colortbl;\\red255\\green255\\blue255;}\n{\\*\\expandedcolortbl;;}\n\\paperw11900\\paperh16840\\margl1440\\margr1440\\vieww11520\\viewh8400\\viewkind0\n\\pard\\tx566\\tx1133\\tx1700\\tx2267\\tx2834\\tx3401\\tx3968\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural\\partightenfactor0\n\n\\f0\\fs24 \\cf0 of\\\nsystem}'
.
I think MS Word does the same. You could test with:
with open("path/to/file.txt") as f:
file = f.read()
file
See what you get.
@markotoplak Is there a way we could sanitize this internally?
@ajdapretnar, I guess you are saving text as rich text format (rtf), not plain text.
@wvdvegte probably has a different problem.
I thought the reason for not considering the first row for filtering is because in rtf, additional parameters get treated as text. So instead of a plain "orange" one would get "{fancyparam:15}orange" and thus the word would not be filtered.
I was indeed referring to the use of plain text (TXT), not RTF.
@wvdvegte Could you perhaps send the stopword list? I cannot replicate the issue, so perhaps there's something about the file that is the problem. Thanks!
I didn't manage to dig up what I was working on when I reported on this in December 2023, but when I'm trying to reproduce the problem, I'm not getting any of the custom stopwords filtered out: stopword filtering.zip
Thank you! Now I've finally managed to reproduce the issue.
As I've suspected, it's the editor. The string reads: '\ufeffpig\ncow\nchicken\nhorse\n'
. The first character is a BOM, typical for Windows apparently. We can solve this by reading the file with encoding='utf-8-sig'
.
Will prepare and test the fix.
Typical for Microsoft, perhaps? I created the text file using Word for Mac ...
Describe the bug In custom .txt (UTF-8) stopwords files, the first word is ignored as a stopword by Preprocess Text, i.e., it is not filtered out.
To Reproduce Create a custom stopwords .txt file in UTF-8 encoding (in my case, I used MS Word), consisting of words separated by returns, and load it in Preprocess text. The first word will not be filtered out but the rest will. Leaving the first line empty solves the problem, but it's not the obvious thing to do.
Expected behavior All custom stopwords should be filtered out.
Orange version: 3.36.2 (I don't know if it's the native Silicon version or the Intel version)
Text add-on version: 1.15.0
Operating system: Mac OS 14.1.2 (23B92)