bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

add txt_contains #53

Closed jwijffels closed 5 years ago

jwijffels commented 5 years ago

but replace trimws to custom implementation to not depend on newer R version

txt_contains <- function(x, terms, value = FALSE, ignore.case = TRUE){
  terms  <- paste(trimws(terms), collapse = "|")
  result <- grepl(pattern = terms, x = x, ignore.case = ignore.case)
  if(value == TRUE){
    result <- x[result]
  }
  result
}
manuelbickel commented 5 years ago

Just a side note: grepl in connection with collapsed patterns is limited regarding the length/number of patterns, see here: https://stackoverflow.com/a/47221034/4907892. The pattern can be quite long, so that the limit will probably not be reached in practice (not sure where the limit acutally is...) - however, depending on how failsafe the function shall be, you might consider this.

jwijffels commented 5 years ago

thanks for the link. was aware of this.

note - add something likes this?

terms_similarity <- function(data, subset, terms, minfreq = 5, family, nfolds = 5){
  stopifnot(all(c("doc_id", "text", "target") %in% colnames(data)))
  if(missing(subset)){
    x <- data
  }else {
    e    <- substitute(subset)
    rows <- eval(e, data, parent.frame())
    rows <- rows & !is.na(rows)
    x <- data[rows, , drop = FALSE]
  }
  x <- x[, c("doc_id", "text")]

  dtm <- strsplit.data.frame(x, term = "text", group = "doc_id")
  dtm$term <- dtm$text
  dtm$freq <- 1L
  dtm <- document_term_matrix(dtm[, c("doc_id", "term", "freq")])

  terminology   <- txt_contains(colnames(dtm), terms = terms, value = TRUE)
  X <- dtm_remove_lowfreq(dtm, minfreq = minfreq)
  X <- dtm_remove_terms(X, terms = terminology)
  Y <- data$target[match(x = rownames(X), table = data$doc_id)]

  model <- glmnet::cv.glmnet(x = X, y = Y, family = family, nfolds = nfolds)
  plot(model)
  relevant_lambda1se <- predict(model, type = "coefficients", s = "lambda.1se")[, 1]
  relevant_lambda1se <- relevant_lambda1se[relevant_lambda1se != 0]
  relevant_lambda1se <- sort(relevant_lambda1se)
  relevant_lambda1se <- data.frame(term = names(relevant_lambda1se), similarity = as.numeric(relevant_lambda1se), stringsAsFactors = FALSE)

  relevant_lambdamin <- predict(model, type = "coefficients", s = "lambda.min")[, 1]
  relevant_lambdamin <- relevant_lambdamin[relevant_lambdamin != 0]
  relevant_lambdamin <- sort(relevant_lambdamin, decreasing = TRUE)
  relevant_lambdamin <- data.frame(term = names(relevant_lambdamin), similarity = as.numeric(relevant_lambdamin), stringsAsFactors = FALSE)
  list(terms = terms,
       terminology = terminology,
       similarity = list(lambda.1se = relevant_lambda1se, lambda.min = relevant_lambdamin))
}
jwijffels commented 5 years ago

Docs of grepl says: Long regular expressions may or may not be accepted: the POSIX standard only requires up to 256 bytes. I went for the safer option and wrote a loop. Similarity functions will be put into another package to avoid dependency issues.