DS4PS / cpp-527-spr-2021

http://ds4ps.org/cpp-527-spr-2021/
0 stars 0 forks source link

Lab 03 - removing "hair spaces" so strsplit( d$title ), " " ) always works #12

Open AprilPeck opened 3 years ago

AprilPeck commented 3 years ago

In about half the titles, my strsplit() function is keeping the last two words together. It's not doing it to every title, but the majority of them.

word.list <- strsplit( tolower( d$title )," " )
word.list

returns: 
[[1]]
[1] "a"              "beginner’s"     "guide"          "to"             "word"          
[6] "embedding"      "with"           "gensim"         "word2vec model"

[[2]]
[1] "hands-on"  "graph"     "neural"    "networks"  "with"      "pytorch"   "&"         "pytorch"  
[9] "geometric"

[[3]]
[1] "how"       "to"        "use"       "ggplot2"   "in python"

[[4]]
 [1] "databricks:"    "how"            "to"             "save"           "files"         
 [6] "in"             "csv"            "on"             "your"           "local computer"
lecy commented 3 years ago

It looks like the hair space was converted back to a regular space in the HTML document.

Try this one - it should be a hairspace here:

gsub( " ", "_", "markets himself" )

If that doesn't work, you can just grab it directly from the words where it appears:

> tail( word.list )
 [9] "on"                      "your website?"          
[1] "how"             "donald"          "trump"           "markets himself"
[7] "blog post"

Copy any of those spaces, put them back into the gsub statement, and try it again:

d$title <- gsub( " ", " ", d$title )

I'll see if I can find a more robust solution that the browser does not break in the instructions.

lecy commented 3 years ago

Ah, here we go. This one is a little better. It replaces all space (including hair spaces) with regular white spaces and the browser won't break it (you can copy and paste directly from the instructions).

d$title <- gsub( "\\s", " ", d$title )