drostlab / metablastr

Seamless Integration of BLAST Sequence Searches in R
https://drostlab.github.io/metablastr/
GNU General Public License v2.0
31 stars 8 forks source link

Feature request : add taxon id for each blast hit #7

Open cparsania opened 4 years ago

cparsania commented 4 years ago

Default blast tabular format output (outfmt 7) doesn't add taxon id for each blast hit. Taxon id is very important for downstream phylogenetic analysis. Indirect approach to add taxon id is to run the blastdbcmd with option %T once the results are obtained. This is very time consuming as you have to get taxon first and map back to original blast results. Can metablstr has function which can map taxon id to blast outcome ?

HajkD commented 4 years ago

Hi @cparsania

Many thanks for contacting me and I very much appreciate your feedback.

Would it be possible to be more specific where you miss the taxonid information? Is it when BLASTing e.g. against NCBI nr or when using metablastr::blast_genomes()? Because in any other scenario the scientific name of the species is given when BLASTing against a genome.

I will then see what I can do.

Many thanks, Hajk

cparsania commented 4 years ago

Yes, you are right. BLAST gives subject scientific names but not taxon id. taxon id is required, for example if you want to assign specific taxonomy rank (e.g. family, class, genus, kingdom, superkingdom etc. ) to given species.

After I raised this issue here, I found an R package taxize which actually solve the problem. In that package, there is a function called taxize::genbank2uid() which gives NCBI taxonomy id for a given genebank id.

Below is the wrapper function I wrote which just reformat output of taxize::genbank2uid() and return as a tbl

#' Wrapper function around taxize::genbank2uid.
#'
#' Given a genBank accession alphanumeric string, or a gi numeric string \code{(x)}, it returns tibble of taxid, name and other columns.
#' @param x vector of genBank accession alphanumeric string, or a gi numeric string \code{(x)}.
#' @param ... other parameters to be passed to \code{taxize::genbank2uid}
#'
#' @return a tbl with colnames x, taxid, class, match, multiple_matches, pattern_match, uri, name
#' @export
#' @importFrom taxize genbank2uid
#' @importFrom tibble tibble
#' @importFrom dplyr bind_cols
#' @importFrom purrr map_df
#' @examples
#' \dontrun{
#' x <- c("XP_022900619.1", "XP_022900618.1", "XP_018333511.1", "XP_018573075.1")
#' genbank2uid_tbl(x = x)
#' }
genbank2uid_tbl <- function(x , ...){

        #start_time <- lubridate::now()
        uid_list <- taxize::genbank2uid(x ,  ...)
        uid_tbl <- tibble::tibble(x = x, taxid = unlist(uid_list)) %>%
                dplyr::bind_cols( purrr::map_df(uid_list , attributes))
        time_taken <- start_time - lubridate::now()
        #cat_green_tick("done. ", " Time taken " , time_taken)
        return(uid_tbl)

}
HajkD commented 4 years ago

Hi @cparsania

Excellent. I will have a look at how to best integrate this taxonomy information into the BLAST output.

Many thanks, Hajk