Open mtwelker opened 3 years ago
Yeah, it's an issue as word processors get fancier and incorporate nuanced typesetting. There are now dozens of different stylized white spaces!
https://en.wikipedia.org/wiki/Whitespace_character
Don't worry about perfection for this lab. If this were a real project for a client we would spend the time finding all of these weird cases and cleaning them up.
At this stage I want you to grasp the basics of text manipulation. As long as you have a sense of how you would approach data cleaning if you were interested in fixing all of these issues you are fine.
The spaces are the worst!
Try copying the special space in R, and using that as a find a replace character with a regular space as substitute?
I feel like that won't work but I can't remember why...
Thanks for that idea -- I tried previewing the titles in the R console, then copied the offending characters from there, but it didn't make any difference. It was worth a try! I'll just go with the last word or two-word phrase, as the case may be..
You've reached the boundary of my knowledge.
If you search for it, this is what comes up. They explain what that character is at least:
When I split the titles into words, in many cases the last two words of the title didn't split apart.
Here's my code:
And here's a screenshot of the results, with the problematic words highlighted:
I went back to those titles before I did the text pre-processing and found that those problematic spaces all occurred where I had removed this special character  (after removing the "hair spaces"), as follows:
Here's an example of what they looked like before pre-processing:
When I realized that, I also tried replacing the special character and the space with just a space, in each of these two ways:
But each of those appeared to have no effect. The  was still there.
I also tried
d$title <- gsub("\\Â", "", d$title)
, but it had exactly the same effect asd$title <- gsub("Â", "", d$title)
. That is to say, it removed the Â, but then wouldn't split those two words at that space.Is there a special way to remove this special character and allow the space next to it to be recognized as a regular space? Thank you!