IntegralEnvision / integral

Package for Integral functions
https://integralenvision.github.io/integral/
Other
0 stars 0 forks source link

Feature Request: Function to intelligently create line breaks in long chemical names #69

Open jzadra opened 1 year ago

jzadra commented 1 year ago

I have written some fairly complex code for the ACC project that looks for places to break long chemical names for string wrapping starting from optimal places and then checking for other places if the string length per line is still too long. Should functionalize this and add it to this package.

jzadra commented 1 year ago

Example code from ACC "company_report_cleaning.R":

mysep <- c("\\s", ",", "-") #This is a method of finding good locations in a long chemical string to put line breaks. This just happens to work for the priority because of the alphabetical order (try sort(mysep)).  If seps are added will need to re-work how priority works. #JZ 1-20-22

chem_name_pretty <- mysep %>% #finding line break locations
      map(function(mysep) { #we go through an iteration of this for each separating character
        chem_name %>% 
          separate_rows(chemical_name, sep = mysep) %>% 
          mutate(chars = nchar(chemical_name)) %>% 
          mutate(sumchar = cumsum(chars)) %>% 
          mutate(line = sumchar %/% 50) %>% #How many additional lines beyond 1 the name will take up for that separator
          add_column(mysep)
      }) %>% 
      bind_rows() %>% #Combine the results of each sep character break
      group_by(mysep) %>% 
      mutate(priority = cur_group_id()) %>% #Spaces are the best breakpoints, followed by commas, and then by dashes. This just happens to work for the priority because of the alphabetical order (try sort(mysep)).
      filter(!any(chars > 50)) %>%  #If any group has more than 50 chars in a single line (ie it doesn't get separated into <50), we get ride of the entire group.
      ungroup()

    if(nrow(chem_name_pretty) == 0) { #if the name can't be broken into less than 50 with any of our separators, we just break at 50 chars
      #print(glue("Forcing break of {chem_name} for {cc}"))
      chem_name_pretty <- chem_name %>% 
        str_wrap(width = 50) %>% 
        stringr::str_replace_all(fixed("\n"), fixed(" \\newline "))

      #This is so ugly, but I'm out of time on this, and there aren't too many examples of this issue. Example is cas 101162601
      is_badwrap <- chem_name_pretty %>% 
        enframe(name = NULL) %>% 
        separate_rows(value, sep = "\\\\newline") %>% 
        mutate(char = nchar(value)) %>% 
        pull(char) %>% 
        any(. > 52)

      if(is_badwrap) {
        n_splits <- nchar(chem_name) %/% 52

        split_name <- character()
        for(i in seq(n_splits + 1)) {
          split_name <- paste0(split_name, str_sub(chem_name, i*52 -51, i*51), " \\newline ")
        }
        chem_name_pretty <- split_name %>% str_remove(" \\\\newline $")
      } #/ugly

      # chem_name %>% 
      #   separate_rows(chemical_name, sep = "\\[") %>% 
      #   mutate(chemical_name = paste0(chemical_name, "[")) %>% 
      #   mutate(char = nchar(chemical_name)) %>% 
      #   mutate(cumchar = cumsum(char)) %>% 
      #   mutate(line = cumchar %/% 52)

    } else { #If we do have one that doesn't have more than 50 lines, we have to put it back together into a single string
      chem_name_pretty <- chem_name_pretty %>% 
        group_by(mysep) %>% 
        mutate(lines = max(line)) %>% #Total additional lines beyond one for the separator
        ungroup() %>% 
        filter(lines == min(lines)) %>% #Keep whichever seperater has the fewest total lines. #TODO: would be better to combine seps so that even if most of the time spaces work, but one time another sep has fewer lines, we don't discard the highest priority sep all together.  This should be a function in integral
        filter(priority == max(priority)) %>% #In cases where we have ties, we keep the highest priority sep.
        group_by(line, mysep) %>% 
        summarize(linecat = paste0(chemical_name, collapse = mysep), .groups = "drop") %>% #Combine separated rows for each line back into a single row using the sep
        ungroup() %>% 
        summarize(linecat = paste0(linecat, collapse = paste0(mysep, " \\newline "))) %>% #Combined all lines into a single row.
        mutate(linecat = str_replace_all(linecat, fixed("\\s"), " ")) %>% #If we were using a '\\s' sep, convert it into a space.
        pull(linecat)
    }