STAT545-UBC / Discussion

Public discussion
37 stars 20 forks source link

Question about regular expression and loop #535

Open chenchenguo opened 5 years ago

chenchenguo commented 5 years ago

Hi, I met a problem when I want to implement a loop in regular expression. What if I want to search a specified list of letters, like search "a", "b", "c", "d",..."z", sequentially? right now, what I am implementing this through writing 26 regular expressions, which I know is stupid, but how to figure it out in just one loop or something else? Thanks in advance.

ChadFibke commented 5 years ago

Hey @chenchenguo,

Can you provide us with a bit more context (what is the input, output, and what did you want to accomplish)?

chenchenguo commented 5 years ago

Hey @chenchenguo,

Can you provide us with a bit more context (what is the input, output, and what did you want to accomplish)?

Thanks @ChadFibke

The data is all those words filtered from words.txt, which has same starting and ending letter like "bob", "kick". Now I want to count the number that how many words for each letter (from "a" to "z")? Like the number of words for starting and ending with "a" is 20, starting and ending with "b" maybe is 50. Right now my implementation is to write down for each letter: a <- str_subset(data, "^a"); a_number <- length(a). And I repeated it for 26 times. Is there any loop methods to figure thsi out? Thanks a lot.

zeeva85 commented 5 years ago

words <- readLines("words.txt")

output <- vector("character", length(letters)) for (i in letters) {
output[match(i, letters)] <- paste0("^", i, ".*", i, "$") # this is regex }

This gives the regex

df <- tibble(letters, start_letter = seq_along(letters)) # make tibble

for (i in output) { df [match(i, output), 2] <- sum(str_count(words, pattern = i)) }

frequency table

I think should work

ChadFibke commented 5 years ago

Ah I found something as well:

count_all_hits<-function(a_charater_vector, pattern_list){

  require(purrr)

  # Lets make a list for our results 

  results <- list()

for ( match in pattern_list) {

results[[sprintf("Matches for %s",match)]] <- a_charater_vector[grepl(sprintf("^%s.*%s$", match, match), a_charater_vector)]

}

return(map(results, length))

}

count_all_hits(a_charater_vector = wordss, pattern_list = letters)
ChadFibke commented 5 years ago

sprintf() is definitely a function to look into. sprint will allow you to expand variable names in a character string. The sprintf("Matches for %s",match) will place the character value of the match object into the string. The %s means to print a string with the character value found in match.

ChadFibke commented 5 years ago

Also.. I converted all the string to lowercase using:

wordss<-str_to_lower(readLines("./words.txt"))

If you do not want to count, and actually want to see the words remove then replace:

return(map(results, length))

# with
return(results)
# which will give you a list with all the found words.
bassamjaved commented 5 years ago

Here's another possibility...

There's an exercise from Hadley's R for Data Science in the strings chapter that can be adapted for this.

You could create a string to the effect "^a|^b|^c" and continue all the way to the letter 'z'. Let's call that string letter_match, which we'll use to match up with regex. Then,

find and extract matches

matches <- str_extract(words, letter_match)

create a frequency table

Letters <- table(matches)

ChadFibke commented 5 years ago

@bassamjaved,

Are you able to use that to find words that start with and end with a, b, c....z?

bassamjaved commented 5 years ago

@ChadFibke

I just tried it with replacing letter_match with “a$|b$|c$” all the way to z. Checked the first few entries of words and it seems to work.

Of course, like you said though, you should use str_to_lower() to make words all lowercase (more so if you’re trying to find words beginning with each letter)

On Nov 29, 2018, at 1:54 PM, FIBKE notifications@github.com wrote:

@bassamjaved https://github.com/bassamjaved,

Are you able to use that to find words that start with and end with a, b, c....z?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/STAT545-UBC/Discussion/issues/535#issuecomment-443007693, or mute the thread https://github.com/notifications/unsubscribe-auth/Ao_FKra9gDTsKQJI_kosXSIQXWFbXgHmks5u0FetgaJpZM4Y6cV1.

bassamjaved commented 5 years ago

Ah but I see you want start and end with the same letter. Okay, no I haven't tried that with this particular method...

chenchenguo commented 5 years ago

@zeeva85 Thanks, your function is so concise and useful. For the df[match(i, output), 2] what is the meaning 2 here? The start letter row?

chenchenguo commented 5 years ago

Thanks @ChadFibke I will try your suggestion

zeeva85 commented 5 years ago

@zeeva85 Thanks, your function is so concise and useful. For the df[match(i, output), 2] what is the meaning 2 here? The start letter row?

Correct, sum the values then replace the 1:26 in 2nd column ("start_letter")

This should work also i think df[match(i, output), "start_letter"], its more explicit and probably better, prevents errors

df[row, column]

chenchenguo commented 5 years ago

Ah but I see you want start and end with the same letter. Okay, no I haven't tried that with this particular method...

Yeah, the part of start and end with same letter is done.. I will try str_extract function here, thank you

chenchenguo commented 5 years ago

Also.. I converted all the string to lowercase using:

wordss<-str_to_lower(readLines("./words.txt"))

If you do not want to count, and actually want to see the words remove then replace:

return(map(results, length))

# with
return(results)
# which will give you a list with all the found words.

Nice, yeah I fogot to switch them to lower case, thanks for notice

bassamjaved commented 5 years ago

here's a revision of the method I posted earlier:

create a regular expression pattern that begin with a letter and ends with the same letter

(letters_for_regex <- str_c("(", "^", letters, ".+", letters, "$", ")"))

collapse into one string

(letter_match <- str_c(letters_for_regex, collapse = "|"))

find and subset matches

(words_with_matches <- str_subset(words_lowercase, letter_match))

extract letters in matches

(letters_in_matches <- str_extract(words_with_matches, "^."))

create a frequency table

(Letters <- table(letters_in_matches))

ChadFibke commented 5 years ago

Well @chenchenguo has multiple answers to choose from now!