Closed ClaudiaHebert closed 1 month ago
Hello @ClaudiaHebert,
The plus symbols are a bit tricky in regex because they serve as one of the quantifiers. In order to remove them, you have to escape them (see the cheatsheet for details on the quantifiers and escaping: https://github.com/Watts-College/cpp-528-fall-2021/blob/main/lectures/RegExCheatsheetInR.pdf).
For example:
gsub("<U\\+200A>", "", df$title)
Hi @castower ,
Thank you for the help with the previous question, I was wondering the same thing. I had an additional question regarding the HTML tags. I found what I think looks like an additional HTML tag in some of the titles, such as row 11. The title there starts with:
<em class="markup--em markup--h3-em">
However, when I try to remove this with the same code as the other HTML tag (the same one Claudia used), I get an "unexpected symbol" error.
d$title <- gsub("<em class="markup--em markup--h3-em">", "", d$title)
Error: unexpected symbol in "d$title <- gsub("<em class\\="markup"
It looks like the error is stemming from the = sign after the word class. I tried to use the escaping method like you suggested for the plus symbol, but that didn't seem to work. Any help would be appreciated!
Hello @pmorrizonaz ,
In R, quotes come in pairs so in effect if you place a set of double quotation marks within another set of double quotation, you'll end up with part of your code being outside of the quote.
In your code above, R interprets this as the following:
quote1: "<em class="
unquoted: markup--em markup--h3-em
quote2: ">"
That middle unquoted part breaks the code.
Therefore, to resolve it, you can combine single and double quotes so that everything goes with the expected pair:
d$title <- gsub('<em class="markup--em markup--h3-em">', "", d$title)
Thank you @castower , that worked perfectly!
Hi, everyone! I seemed to have a fair amount of HTML code (etc.) that I needed to clean for. I ended up using the grep() function to locate any titles that had ">" or "<" in it and that helped pinpoint the characters/phrases that I still needed to clean. I hope that was an alright direction to go in.
However, I am having trouble cleaning the following title:
I have run the following code, and it seems to run fine, but the titles don't seem to change:
Nevermind, I realized that it was the same issue of needing to escape certain characters. This has been resolved, thanks!
@castower In the text processing steps in Lab 3, we're instructed to: Remove strange spaces Remove HTML tags Remove “hair space” dash marks
Removing HTML tags seems to work but I cannot figure out what's going on with the hair spaces. (<U+200A>—<U+200A>).
The code I have is: d$title <- gsub("<U+200A>—<U+200A>", " ", d$title).
This code runs fine but when I actually view the data these hair spaces are still present. However when I test with
any(grepl("<U+200A>—<U+200A>", d$title))
it returns FALSE, telling me that there are none in the data. But again, I can visibly see them in the data. Any ideas what's going on here?
I'm also not sure if removing other strange spaces worked or not since that's harder to visibly check.