Watts-College / cpp-527-fall-2021

A course shell for CPP 527 Foundations of Data Science II
https://watts-college.github.io/cpp-527-fall-2021/
2 stars 6 forks source link

Special character  causing problems when splitting words of title--Lab 03 #22

Open mtwelker opened 3 years ago

mtwelker commented 3 years ago

When I split the titles into words, in many cases the last two words of the title didn't split apart.

Here's my code:

# Split each title into a group of distinct words.
title.words.list <- strsplit(d$title, " ")
# Check results
head(title.words.list, 50)

And here's a screenshot of the results, with the problematic words highlighted: image

I went back to those titles before I did the text pre-processing and found that those problematic spaces all occurred where I had removed this special character  (after removing the "hair spaces"), as follows:

# replace all versions of space including special styles like the 'hair space'  with regular spaces
d$title <- gsub( "\\s", " ", d$title )

d$title <- gsub("Â", "", d$title)

Here's an example of what they looked like before pre-processing: image

When I realized that, I also tried replacing the special character and the space with just a space, in each of these two ways:

d$title <- gsub("Â ", " ", d$title)
d$title <- gsub("Â\\s ", " ", d$title)

But each of those appeared to have no effect. The  was still there.

I also tried d$title <- gsub("\\Â", "", d$title), but it had exactly the same effect as d$title <- gsub("Â", "", d$title). That is to say, it removed the Â, but then wouldn't split those two words at that space.

Is there a special way to remove this special character and allow the space next to it to be recognized as a regular space? Thank you!

lecy commented 3 years ago

Yeah, it's an issue as word processors get fancier and incorporate nuanced typesetting. There are now dozens of different stylized white spaces!

https://en.wikipedia.org/wiki/Whitespace_character

Don't worry about perfection for this lab. If this were a real project for a client we would spend the time finding all of these weird cases and cleaning them up.

At this stage I want you to grasp the basics of text manipulation. As long as you have a sense of how you would approach data cleaning if you were interested in fixing all of these issues you are fine.

The spaces are the worst!

lecy commented 3 years ago

Try copying the special space in R, and using that as a find a replace character with a regular space as substitute?

I feel like that won't work but I can't remember why...

mtwelker commented 3 years ago

Thanks for that idea -- I tried previewing the titles in the R console, then copied the offending characters from there, but it didn't make any difference. It was worth a try! I'll just go with the last word or two-word phrase, as the case may be..

lecy commented 3 years ago

You've reached the boundary of my knowledge.

If you search for it, this is what comes up. They explain what that character is at least:

https://stackoverflow.com/questions/1461907/html-encoding-issues-%C3%82-character-showing-up-instead-of-nbsp