WheatonCS / Lexos

Python/Flask-based website for text analysis workflow. Previous (stable) release is live at:
http://lexos.wheatoncollege.edu
MIT License
119 stars 20 forks source link

Inappropriate treatment of tags when scrubbing #638

Closed EmmaGrace closed 5 years ago

EmmaGrace commented 7 years ago

So right now we have a "scrub tags" option, which allows us to (mostly) treat all html tags on a case-by-case basis when scrubbing. This is good. What we don't have is a "I don't care about tags in the slightest" or a "my text has no tags" button that will allow lowercase, whitespace, digits, and punctuation scrubbing to act on the entire text with impunity. We should implement this.

Right now, even if scrub tags is turned off, these scrubbing actions will not be applied to anything that "looks like a tag" -- where the requirements for looking like a tag are as loose as an opening angle bracket and a closing angle bracket appearing on the same line. There is nothing the user can do to turn off this functionality, because regardless of the options they check, their text will always be fed into general_functions.apply_function_exclude_tags(), which excludes tags when applying our punctuation removal function, etc. We simply do not have a similar general function written that does not exclude tags.

Here are a couple of examples which should illustrate why this is a problem. First, let's say I am a comedically-gifted computer programmer, so I write something like: "<Emma's hilarious joke> Why did the chicken cross the road? Because seven ate nine! </Emma's hilarious joke>" You want to use Lexos to quantify how funny I think I am, so you want to scrub all the punctuation, including my fake html formatting, and count how many times I call my joke "hilarious." Well, currently you can't. Or, let's say I am an artistically-talented programmer and it is your birthday, so I send you a beautifully-written email with an emoticon of you in a party hat eating a double scoop of ice cream: "<[|:-P 00(>" You want to apply Lexos to my email because you don't think I could actually have written such a brilliant missive, but to do so you must scrub away all those ice cream-related digits and punctuation. Right now, you are powerless before the angle brackets, and you cannot destroy my art.

Naturally I feel very good that you have been thwarted, but Lexos should let you do it, right?

scottkleinman commented 7 years ago

Would this be as simple as skipping the tag handling function if the Scrub Tags check box is left unchecked? Would that cause any unexpected consequences?

EmmaGrace commented 7 years ago

No, we already skip the tag handling function if scrub tags is unchecked. The problem is that regardless of whether the user wants to scrub tags, all text that goes through scrub() is passed to general_functions.apply_function_exclude_tags(). We would need to write a second general function that does not exclude tags, then use the tags boolean to determine which to use

mlimoges commented 5 years ago

resolved by pull request #980