VincyaneBadouard / TreeData_broken

Harmonization and correction forest data tool.
https://vincyanebadouard.github.io/TreeData/
0 stars 1 forks source link

Find best way to deal with more thant [Genus species] in species info column #58

Closed ValentineHerr closed 1 year ago

ValentineHerr commented 1 year ago

@LimaRAF said: In TreeCo the subspecies and variety (and so forms and hybrids) in 'Species information' are all under a single column with the infra-specific epiteth. How should I declare the equivalencies of these columns?

I need to find a way to be more flexible with that.... It is really not easy as there are many ways the species info may be recorded...

Maybe it needs to be something similar to codes wher the user is asked to select all the columns related to species info and then is asked to say something like "1st word is family, 2nd is genus, 3rd is species, 4th is subspecies" etc.... The problem is that this is not going to work if not all records have the same number of info....

LimaRAF commented 1 year ago

@ValentineHerr What I have been doing is to add an extra (internal) step to process data with the genus+epiteth+infraepiteth+authorship column format (or any other format/combination of columns that networks may have) to convert it to the format I want to have at the end.

ValentineHerr commented 1 year ago

@LimaRAF can you show me how that looks? (e.g. paste some lines of codes here so I can I get an idea of what it looks like)

LimaRAF commented 1 year ago

Some ideas @ValentineHerr :

1) Try to detect/obtain which are the columns containing tax info (here I am declaring them manually, but this can be user declared or automatized using by grepping keywords (genus, epiteth, species, etc) on the df column names.

col_original <- c("Genus_original", "Species_original")

2) In my case, I want all name info (genus, epiteth, modificators, prepositions, infra-epitehts and authorships) in a single column (function codes below). It is easier to combine all info into a single one than breaking one columns with all info into the appropriate columns...

organismNameOriginal <- .build.name(species.data, col_original)

I generally do some processing before building the names to detect names modificators (cf., aff.), the "prepositions" of names at the infra-specific level (var., subsp., f., etc) and to standardize morpho-species notation. I use the the function plantR::fixSpecies() which is not perfect but help flagging possible cases for most of the situations.

#' 
#' @title Build Organism Name
#' 
#' @description Combine diffent table columns with species name information (i.e.
#'   genus, epiteth, infra-epiteth) into a single organism name
#' 
#' @param x the data frame with the taxonomic information
#' @param col.names the name of the columns containing the information to be
#'   combined in the desired order
#'
#' @return a vector with the combined information
#' 
#'  
#' @keywords internal
#' 
#' @noRd
#' 
.build.name <- function(x, col.names = c("Genus_original", "Species_original"))
  {

  if (any(!col.names %in% colnames(x)))
    stop("One or more names in 'col.names' were not found in 'x'")

  # cols <- names(x)[names(x) %in% col.names]
  cols <- names(x)[match(col.names, names(x), nomatch = 0)]

  if (length(cols) > 1) {

    #organismName <- apply(x[, cols], 1, paste, collapse = " ")
    organismName <- do.call(paste, x[, cols])
    organismName <- gsub(" NA$", "", organismName, perl = TRUE)
    organismName <- .squish(organismName)
    return(organismName)
  } else {
    warning("Less than two colums found; skipping...")
  }    
}

#' 
#' @title Remove Unwanted Spaces
#' 
#' @param x a character or vector
#'
#' @return the character `x` without trailing or double spaces
#'  
#' @keywords internal
#'
#' @noRd
#' 
.squish <- function (x) {
  x <- gsub("\\s\\s+", " ", as.character(x), perl = TRUE)
  x <- gsub("  ", " ", x, perl = TRUE)
  x <- gsub("^ | $", "", x, perl = TRUE)
  return(x)
}
ValentineHerr commented 1 year ago

I see @LimaRAF, thanks for sharing. It is indeed easier to combine all the columns together, but here we want to be able to fill in separate columns since the output profile often needs it...