Watts-College / cpp-527-fall-2021

A course shell for CPP 527 Foundations of Data Science II
https://watts-college.github.io/cpp-527-fall-2021/
2 stars 6 forks source link

Week3- String Processing #13

Open AhmedRashwanASU opened 2 years ago

AhmedRashwanASU commented 2 years ago
x <- gsub( ",", "", x )
x <- gsub( "we’re", "weare", x )
x <- gsub( "\\:", "", x )
x <- gsub( "\\?", "", x )
x

1- While Applying above chunk line 2 is supposed to transform the selected text x <- gsub( "we’re", "we are", x )

However, the Text was as below for the selected word.

"when we’re apart"

grep( "^$", x )

x <- x[ - grep( "^$", x ) ]

2- While Applying the above chunk the which is supposed to remove the blank lines, however after running X = Shows are empty char then returning integer(0)

Any Idea what's going here?

lecy commented 2 years ago

What does X represent here?

AhmedRashwanASU commented 2 years ago

x <- Dear John: I want a man who knows what love is all about. You are generous, kind, thoughtful. People who are not like you admit to being useless and inferior. You have ruined me for other men. I yearn for you. I have no feelings whatsoever when we’re apart. I can be forever happy. Will you let me be yours? Gloria

lecy commented 2 years ago

Ok, it's helpful to share your current data with the example. An easy way is to use dput():

dput( x )
c("Dear John:", "", "I want a man who knows what love is all about. You are generous, kind, thoughtful. People who are not like you admit to being useless and inferior. You have ruined me for other men. I yearn for you. I have no feelings whatsoever when we’re apart. I can be forever happy.", 
"", "Will you let me be yours?", "", "Gloria")

This prints the vector in a format someone can copy and paste directly into R to reproduce the same data.

Here is the data in a reproducible format so you can see how it is read into R from a text file.

# x <- readLines( "https://raw.githubusercontent.com/Nonprofit-Open-Data-Collective/machine_learning_mission_codes/master/docs/tutorials/assets/dear_john_letter_1.txt", warn=FALSE )

x <- 
c("Dear John:", "", "I want a man who knows what love is all about. You are generous, kind, thoughtful. People who are not like you admit to being useless and inferior. You have ruined me for other men. I yearn for you. I have no feelings whatsoever when we’re apart. I can be forever happy.", 
"", "Will you let me be yours?", "", "Gloria")
x
[1] "Dear John:"                                                                                                                                                                                                                                                                    
[2] ""                                                                                                                                                                                                                                                                              
[3] "I want a man who knows what love is all about. You are generous, kind, thoughtful. People who are not like you admit to being useless and inferior. You have ruined me for other men. I yearn for you. I have no feelings whatsoever when we’re apart. I can be forever happy."
[4] ""                                                                                                                                                                                                                                                                              
[5] "Will you let me be yours?"                                                                                                                                                                                                                                                     
[6] ""                                                                                                                                                                                                                                                                              
[7] "Gloria"  

You can see that converting from a text file to a character vector resulted in every other line being empty. This regular expression will identify only empty values (two quote marks "", no spaces, no other characters).

grep( "^$", x )
[1] 2 4 6
x[ - grep( "^$", x ) ]
[1] "Dear John:"                                                                                                                                                                                                                                                                    
[2] "I want a man who knows what love is all about. You are generous, kind, thoughtful. People who are not like you admit to being useless and inferior. You have ruined me for other men. I yearn for you. I have no feelings whatsoever when we’re apart. I can be forever happy."
[3] "Will you let me be yours?"                                                                                                                                                                                                                                                     
[4] "Gloria" 

Since grep() returns the vector position (unless you add the argument value=TRUE) then the subtraction sign will drop all of those values from the vector:

x <- x[ - grep( "^$", x ) ]
# equivalent to
x <- x[ - c(2,4,6) ] # drop lines 2, 4, 6
lecy commented 2 years ago

The first issue is a little more subtle. Modern word processors and browsers have started replacing regular quote marks with stylized open and closed quote marks for visual design purposes.

image

MS Word and some browsers will do this in the background automatically without asking. When the text is copied and pasted or exported, then, it will carry these new characters forward.

For example:

x <- c( "we’re", "we're" )
gsub( "we’re", "we are", x )
[1] "we are" "we're" 
gsub( "we're", "we are", x )
[1] "we’re"  "we are"

Spaces are actually one of the biggest issues - word processors and browsers have started using narrow spaces and wide spaces in addition to the regular space. They are different characters but are almost invisible to the naked eye, so can cause issues when trying to remove spaces.

As a consequence, when you are loading text data you need to be conscious of whether these special characters have been introduced into your data.

Each character is referenced by a separate code in an ASCII hash table, the universal table all computers use to map digits to text. ASCII stands for the American Standard Code for Information Interchange.

Each character is represented by several codes because different systems will use different representations. The computer will use the BIN (8 bit or one byte binary version) to store a character in memory (all 1's or 0's). HTML browsers will use the "HTML number":

image

The ASCII table can store 256 unique characters. The first 128 characters in the ASCII table are universal - all computers use the same symbols.

The second half of the table is called the Extended ASCII set. There are multiple versions for the second half of the table, depending upon which language you are trying to encode.

You need to know which version of the extended ASCII table you are working with otherwise you won't know which character a specific code represents. Similar to how R loads packages at the beginning of scripts, raw HTML files list a bunch of libraries or settings so that the browser can load the proper assets to display the pages:

<!DOCTYPE html>
<html lang="en">
<meta charset="utf-8">

Whenever you see text like this it means that there was an extended ASCII character in your text file and the program you are currently working with either doesn't support the extended table characters or it did not know which version you were using, so it uses the raw encoding of the character instead of the character itself.

Here the stylized quote mark was replaced with the code for that character "’".

"when we’re apart"
’ =   ’ 
"when we’re apart"

I'm not entirely sure when this happens, but if you search for the code "’" you will find a page like this:

https://www.i18nqa.com/debug/utf8-debug.html

image

Which can help you work backwards to figure out what the character was supposed to be.

Typically these issues are reconciled at the data load and data cleaning step in text analysis. Sometimes you can simply delete anything from the extended ASCII table if it is not pertinent to your analysis. Other times you would basically search for all of these weird codes and replace them with the intended text.

There are, of course, R packages for all of this.

Probably more than you wanted to know! But does it make sense?

AhmedRashwanASU commented 2 years ago

Thanks, Prof for this, yes it makes more sense