JiaxiangBU / tutoring2

The collection of Python and R code scripts to tutor others.
https://jiaxiangbu.github.io/tutoring2/
Other
8 stars 7 forks source link

文本模糊匹配 #46

Closed slsongge closed 4 years ago

slsongge commented 4 years ago

问题

在R中,我有一个需求,我这边只会写两个for,并且还没有彻底完成我的需求。我写的for循环只能返回下图: image

大概需求如下: 我有两个数据框df_1和df_2,每个数据框都只有一列,列名分别是A和B。我需要在df_1里面增加一列B,此列的填充逻辑为: 模糊匹配df_2的B列中的每个值是否在A列中出现,如果出现就将df_1的新增的B填充为df_2$B匹配成功的值,否则填充为NA。

原始数据和所需返回结果

image

代码

library(tidyverse)

df_1 <- data.frame(A = c('asd','adf','afg','agh','ahj'))
df_2 <- data.frame(B = c('as','af','aj'))

df_1_A <- df_1$A
df_2_B <- df_2$B

c_A <- c()
c_B <- c()
for (i in df_1_A) {

  c_A   <- append(c_A, i)
  c_tmp <- c()

  for (j in df_2_B) {
    c_tmp <- append(c_tmp, str_detect(i, j))
    if (str_detect(i, j) == TRUE) {
      print(str_c(i, ':  ', j))
    }
  }

  c_B <- append(c_B, sum(c_tmp))

}

rs <- data.frame(c_A, c_B); rs
JiaxiangBU commented 4 years ago

@slsongge

library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------- tidyverse 1.2.1 --

## √ ggplot2 3.2.1     √ purrr   0.3.3
## √ tibble  2.1.3     √ dplyr   0.8.3
## √ tidyr   1.0.2     √ stringr 1.4.0
## √ readr   1.3.1     √ forcats 0.4.0

## Warning: package 'ggplot2' was built under R version 3.6.2

## Warning: package 'tidyr' was built under R version 3.6.2

## Warning: package 'purrr' was built under R version 3.6.1

## Warning: package 'dplyr' was built under R version 3.6.1

## -- Conflicts ------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
df_1 <- data.frame(A = c('asd', 'adf', 'afg', 'agh', 'ahj'), stringsAsFactors = FALSE)
df_2 <- data.frame(B = c('as', 'af', 'aj'), stringsAsFactors = FALSE)

这里 for 循环我觉得复杂了,可以向量化操作。

left_join(df_1 %>% mutate(on = 1),
          df_2 %>% mutate(on = 1), by = 'on') %>% 
    select(-on) %>% 
    group_by(A) %>% 
    mutate(match = str_detect(A,B),
           text = ifelse(match==1,B,"")) %>% 
    summarise(
        match = sum(match),
        text = str_flatten(text,"")
    )
## # A tibble: 5 x 3
##   A     match text 
##   <chr> <int> <chr>
## 1 adf       0 ""   
## 2 afg       1 af   
## 3 agh       0 ""   
## 4 ahj       0 ""   
## 5 asd       1 as