Open cparsania opened 4 years ago
Hi @cparsania
Many thanks for contacting me and I very much appreciate your feedback.
Would it be possible to be more specific where you miss the taxonid information?
Is it when BLASTing e.g. against NCBI nr
or when using metablastr::blast_genomes()
?
Because in any other scenario the scientific name of the species is given when BLASTing against a genome.
I will then see what I can do.
Many thanks, Hajk
Yes, you are right. BLAST gives subject scientific names but not taxon id. taxon id is required, for example if you want to assign specific taxonomy rank (e.g. family, class, genus, kingdom, superkingdom etc. ) to given species.
After I raised this issue here, I found an R package taxize
which actually solve the problem. In that package, there is a function called taxize::genbank2uid()
which gives NCBI taxonomy id for a given genebank id.
Below is the wrapper function I wrote which just reformat output of taxize::genbank2uid()
and return as a tbl
#' Wrapper function around taxize::genbank2uid.
#'
#' Given a genBank accession alphanumeric string, or a gi numeric string \code{(x)}, it returns tibble of taxid, name and other columns.
#' @param x vector of genBank accession alphanumeric string, or a gi numeric string \code{(x)}.
#' @param ... other parameters to be passed to \code{taxize::genbank2uid}
#'
#' @return a tbl with colnames x, taxid, class, match, multiple_matches, pattern_match, uri, name
#' @export
#' @importFrom taxize genbank2uid
#' @importFrom tibble tibble
#' @importFrom dplyr bind_cols
#' @importFrom purrr map_df
#' @examples
#' \dontrun{
#' x <- c("XP_022900619.1", "XP_022900618.1", "XP_018333511.1", "XP_018573075.1")
#' genbank2uid_tbl(x = x)
#' }
genbank2uid_tbl <- function(x , ...){
#start_time <- lubridate::now()
uid_list <- taxize::genbank2uid(x , ...)
uid_tbl <- tibble::tibble(x = x, taxid = unlist(uid_list)) %>%
dplyr::bind_cols( purrr::map_df(uid_list , attributes))
time_taken <- start_time - lubridate::now()
#cat_green_tick("done. ", " Time taken " , time_taken)
return(uid_tbl)
}
Hi @cparsania
Excellent. I will have a look at how to best integrate this taxonomy information into the BLAST output.
Many thanks, Hajk
Default blast tabular format output (outfmt 7) doesn't add taxon id for each blast hit. Taxon id is very important for downstream phylogenetic analysis. Indirect approach to add taxon id is to run the
blastdbcmd
with option%T
once the results are obtained. This is very time consuming as you have to get taxon first and map back to original blast results. Canmetablstr
has function which can map taxon id to blast outcome ?