Watts-College / cpp-527-fall-2021

A course shell for CPP 527 Foundations of Data Science II
https://watts-college.github.io/cpp-527-fall-2021/
2 stars 6 forks source link

I'm confused about "Regex in the wild" from the week 3 slides #19

Open mtwelker opened 3 years ago

mtwelker commented 3 years ago

In the slides for this week, there is a section titled "Regex in the wild: advanced examples for inspiration." I'm having a hard time making sense of these. I tried going to the link provided, but it says "Page Not Found." For example, it has this slide: image But I'm not sure how to interpret it or how we would use it in combination with grep( ). Here are my questions:

lecy commented 3 years ago

The delimiters are described in the lecture notes as regular expression operators. I use the term "anchors" for ^ and $ because they anchor text to the beginning or end of a string.

The term string is more precise than line (or sentence or word since one string can contain multiple words or sentences).

There are other regular expression anchors that can match letter combinations at the end of words (before a space, period, or line break). See the cheat sheet.

grep( "why", title )  # any title with why in it
grep( "^why", title ) # only titles that start with why 

When you "slugify" text you convert it to a stable HTML filename - no spaces or special characters:

<title> The Hangover Part 3 </title>
<content> A silly comedy movie </content>
<slug> the-hangover-part-3 </slug>

R does not use the delimiters (not sure what environment this example was from but syntax varies slightly for regular expressions).

Here is the expression identifying the slug in the vector:

x <- c("The Hangover Part 3","the-hangover-part-3") 
grep( "^[a-z0-9-]+$", x, value=TRUE )
[1] "the-hangover-part-3"
mtwelker commented 3 years ago

Thank you, that's very helpful!

Can you explain to me how grep( "^[a-z0-9-]+$", x, value=TRUE ) is different from grep( "[a-z0-9-]+", x, value=TRUE ). Do the "anchors" (^ and $) in the first expression ask the expression to return strings that contain only those elements (letters, numbers, and dashes)? And the second expression returns strings that contain any of those elements? Based on the code below, I think that's the case, but let me know if I'm missing something.

> x <- c("The Hangover Part 3","the-hangover-part-3", "The Hangover Part #", "The-hangover-part-&", "!@# $%^")
> grep( "^[a-z0-9-]+$", x, value=TRUE )
[1] "the-hangover-part-3"
> grep( "[a-z0-9-]+", x, value=TRUE )
[1] "The Hangover Part 3" "the-hangover-part-3" "The Hangover Part #" "The-hangover-part-&"

Also, what is the purpose of the + in these expressions? I know it means "one or more", but it seems to be necessary only if you use the anchors. Why is that?

> grep( "^[a-z0-9-]$", x, value=TRUE )
character(0)
> grep( "[a-z0-9-]", x, value=TRUE )
[1] "The Hangover Part 3" "the-hangover-part-3" "The Hangover Part #" "The-hangover-part-&"

Maybe I'm overthinking this, but it seems like there must be a strict underlying logic, and I'm trying to figure out what that is. Thank you!

lecy commented 3 years ago

I think the use case here would be extracting slugs from URLs.

So it is more like:

x <- c( "https://www.imdb.com/title/the-hangover-part-3", 
        "the-hangover-part-3", 
        "https://www.imdb.com/title/the-hangover-part-3/cast" )

grep( "[a-z0-9-]+", x, value=TRUE )
[1] "https://www.imdb.com/title/the-hangover-part-3"     
[2] "the-hangover-part-3"                                
[3] "https://www.imdb.com/title/the-hangover-part-3/cast"

grep( "/[a-z0-9-]+/", x, value=TRUE )
[1] "https://www.imdb.com/title/the-hangover-part-3"     
[2] "https://www.imdb.com/title/the-hangover-part-3/cast"

grep( "^[a-z0-9-]+$", x, value=TRUE )
[1] "the-hangover-part-3"

Some of what you are doing is omitting case variations:

x <- c( "The Hangover Part 3",
        "the-hangover-part-3",  # slugified
        "The Hangover Part #", 
        "The-Hangover-Part-3",  # contains capitals
        "the-hangover-part-&" ) # contains special chars

# regex pseudocode: [a-z0-9-]
# contains lowercase a-z
# contains numbers 0-9
# contains dash -

grep( "^[a-z0-9-]+$", x, value=TRUE )
[1] "the-hangover-part-3"

# THESE TWO ARE EQUIVALENT: 

grep( "^[a-z0-9-]+$", x, value=TRUE, ignore.case=TRUE )
[1] "the-hangover-part-3" "The-Hangover-Part-3"

# regex pseudocode: [a-zA-Z0-9-]
# add uppercase A-Z

grep( "^[a-zA-Z0-9-]+$", x, value=TRUE )
[1] "the-hangover-part-3" "The-Hangover-Part-3"

You would need a different expression to search for URLs that contain slugs, but not exactly sure how to construct it. I would have to experiment a bit.

mtwelker commented 3 years ago

Thanks -- today's review session helped clear this up.

lecy commented 3 years ago

Regular expressions are not easy!

I used to have a paper exam for students and I would ask them questions like, Which strings would be returned here:

x <- c( "The Hangover Part 3",
        "the-hangover-part-3", 
        "The Hangover Part #", 
        "The-Hangover-Part-3",  
        "the-hangover-part-&" ) 

grep( "^[a-z0-9-]+$", x, value=TRUE, ignore.case=FALSE )

And without running the code they had to determine the correct answer. That's the real test of whether you understand regular expressions.

So be happy we don't have an exam in this iteration :-)