Watts-College / paf-514-template

https://watts-college.github.io/paf-514-template/
1 stars 0 forks source link

Lab 3 - Hair Spaces Not Removing #75

Closed ClaudiaHebert closed 1 month ago

ClaudiaHebert commented 2 months ago

@castower In the text processing steps in Lab 3, we're instructed to: Remove strange spaces Remove HTML tags Remove “hair space” dash marks

Removing HTML tags seems to work but I cannot figure out what's going on with the hair spaces. (<U+200A>—<U+200A>).

The code I have is: d$title <- gsub("<U+200A>—<U+200A>", " ", d$title).

This code runs fine but when I actually view the data these hair spaces are still present. However when I test with

any(grepl("<U+200A>—<U+200A>", d$title))

it returns FALSE, telling me that there are none in the data. But again, I can visibly see them in the data. Any ideas what's going on here?

I'm also not sure if removing other strange spaces worked or not since that's harder to visibly check.

castower commented 2 months ago

Hello @ClaudiaHebert,

The plus symbols are a bit tricky in regex because they serve as one of the quantifiers. In order to remove them, you have to escape them (see the cheatsheet for details on the quantifiers and escaping: https://github.com/Watts-College/cpp-528-fall-2021/blob/main/lectures/RegExCheatsheetInR.pdf).

For example: gsub("<U\\+200A>", "", df$title)

pmorrizonaz commented 2 months ago

Hi @castower ,

Thank you for the help with the previous question, I was wondering the same thing. I had an additional question regarding the HTML tags. I found what I think looks like an additional HTML tag in some of the titles, such as row 11. The title there starts with:

<em class="markup--em markup--h3-em">

However, when I try to remove this with the same code as the other HTML tag (the same one Claudia used), I get an "unexpected symbol" error.

d$title <- gsub("<em class="markup--em markup--h3-em">", "", d$title) Error: unexpected symbol in "d$title <- gsub("<em class\\="markup"

It looks like the error is stemming from the = sign after the word class. I tried to use the escaping method like you suggested for the plus symbol, but that didn't seem to work. Any help would be appreciated!

castower commented 2 months ago

Hello @pmorrizonaz ,

In R, quotes come in pairs so in effect if you place a set of double quotation marks within another set of double quotation, you'll end up with part of your code being outside of the quote.

In your code above, R interprets this as the following:

quote1: "<em class="
unquoted: markup--em markup--h3-em
quote2: ">"

That middle unquoted part breaks the code.

Therefore, to resolve it, you can combine single and double quotes so that everything goes with the expected pair:

d$title <- gsub('<em class="markup--em markup--h3-em">', "", d$title)
pmorrizonaz commented 2 months ago

Thank you @castower , that worked perfectly!

CTNovoa commented 2 months ago

Hi, everyone! I seemed to have a fair amount of HTML code (etc.) that I needed to clean for. I ended up using the grep() function to locate any titles that had ">" or "<" in it and that helped pinpoint the characters/phrases that I still needed to clean. I hope that was an alright direction to go in.

However, I am having trouble cleaning the following title:

image

I have run the following code, and it seems to run fine, but the titles don't seem to change:

image

CTNovoa commented 2 months ago

Nevermind, I realized that it was the same issue of needing to escape certain characters. This has been resolved, thanks!