Open chenchenguo opened 5 years ago
Hey @chenchenguo,
Can you provide us with a bit more context (what is the input, output, and what did you want to accomplish)?
Hey @chenchenguo,
Can you provide us with a bit more context (what is the input, output, and what did you want to accomplish)?
Thanks @ChadFibke
The data is all those words filtered from words.txt, which has same starting and ending letter like "bob", "kick". Now I want to count the number that how many words for each letter (from "a" to "z")? Like the number of words for starting and ending with "a" is 20, starting and ending with "b" maybe is 50. Right now my implementation is to write down for each letter: a <- str_subset(data, "^a"); a_number <- length(a). And I repeated it for 26 times. Is there any loop methods to figure thsi out? Thanks a lot.
words <- readLines("words.txt")
output <- vector("character", length(letters))
for (i in letters) {
output[match(i, letters)] <- paste0("^", i, ".*", i, "$") # this is regex
}
This gives the regex
df <- tibble(letters, start_letter = seq_along(letters)) # make tibble
for (i in output) { df [match(i, output), 2] <- sum(str_count(words, pattern = i)) }
frequency table
I think should work
Ah I found something as well:
count_all_hits<-function(a_charater_vector, pattern_list){
require(purrr)
# Lets make a list for our results
results <- list()
for ( match in pattern_list) {
results[[sprintf("Matches for %s",match)]] <- a_charater_vector[grepl(sprintf("^%s.*%s$", match, match), a_charater_vector)]
}
return(map(results, length))
}
count_all_hits(a_charater_vector = wordss, pattern_list = letters)
sprintf()
is definitely a function to look into. sprint will allow you to expand variable names in a character string. The sprintf("Matches for %s",match)
will place the character value of the match object into the string. The %s means to print a string with the character value found in match.
Also.. I converted all the string to lowercase using:
wordss<-str_to_lower(readLines("./words.txt"))
If you do not want to count, and actually want to see the words remove then replace:
return(map(results, length))
# with
return(results)
# which will give you a list with all the found words.
Here's another possibility...
There's an exercise from Hadley's R for Data Science in the strings chapter that can be adapted for this.
You could create a string to the effect "^a|^b|^c" and continue all the way to the letter 'z'. Let's call that string letter_match, which we'll use to match up with regex. Then,
matches <- str_extract(words, letter_match)
Letters <- table(matches)
@bassamjaved,
Are you able to use that to find words that start with and end with a, b, c....z?
@ChadFibke
I just tried it with replacing letter_match with “a$|b$|c$” all the way to z. Checked the first few entries of words
and it seems to work.
Of course, like you said though, you should use str_to_lower() to make words
all lowercase (more so if you’re trying to find words beginning with each letter)
On Nov 29, 2018, at 1:54 PM, FIBKE notifications@github.com wrote:
@bassamjaved https://github.com/bassamjaved,
Are you able to use that to find words that start with and end with a, b, c....z?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/STAT545-UBC/Discussion/issues/535#issuecomment-443007693, or mute the thread https://github.com/notifications/unsubscribe-auth/Ao_FKra9gDTsKQJI_kosXSIQXWFbXgHmks5u0FetgaJpZM4Y6cV1.
Ah but I see you want start and end with the same letter. Okay, no I haven't tried that with this particular method...
@zeeva85 Thanks, your function is so concise and useful. For the df[match(i, output), 2] what is the meaning 2 here? The start letter row?
Thanks @ChadFibke I will try your suggestion
@zeeva85 Thanks, your function is so concise and useful. For the df[match(i, output), 2] what is the meaning 2 here? The start letter row?
Correct, sum the values then replace the 1:26 in 2nd column ("start_letter")
This should work also i think df[match(i, output), "start_letter"]
, its more explicit and probably better, prevents errors
df[row, column]
Ah but I see you want start and end with the same letter. Okay, no I haven't tried that with this particular method...
Yeah, the part of start and end with same letter is done.. I will try str_extract function here, thank you
Also.. I converted all the string to lowercase using:
wordss<-str_to_lower(readLines("./words.txt"))
If you do not want to count, and actually want to see the words remove then replace:
return(map(results, length)) # with return(results) # which will give you a list with all the found words.
Nice, yeah I fogot to switch them to lower case, thanks for notice
here's a revision of the method I posted earlier:
(letters_for_regex <- str_c("(", "^", letters, ".+", letters, "$", ")"))
(letter_match <- str_c(letters_for_regex, collapse = "|"))
(words_with_matches <- str_subset(words_lowercase, letter_match))
(letters_in_matches <- str_extract(words_with_matches, "^."))
(Letters <- table(letters_in_matches))
Well @chenchenguo has multiple answers to choose from now!
Hi, I met a problem when I want to implement a loop in regular expression. What if I want to search a specified list of letters, like search "a", "b", "c", "d",..."z", sequentially? right now, what I am implementing this through writing 26 regular expressions, which I know is stupid, but how to figure it out in just one loop or something else? Thanks in advance.