amcdavid / CellaRepertorium

Methods for clustering and analyzing high-throughput single cell immune cell repertoires (RepSeq)
15 stars 4 forks source link

Option -r 1 in cdhit-est does not work #13

Open SamaZYX opened 2 years ago

SamaZYX commented 2 years ago

I am trying to us the cdhit command to cluster nucleotide sequences in both strands but it does not seem to work.

The option "-r 1" (default parameter) tells cdhit to search in both strands but no matter what value I use, it is not reading it.

Could this be related to the missing part in cdhit-est.cpp from the original code?

if ( options.option_r ) {
    Comp_AAN_idx.resize( seq_db.NAAN );
    make_comp_short_word_index(options.NAA, NAAN_array, Comp_AAN_idx);
}
amcdavid commented 2 years ago

Hi @SamaZYX ,

I haven't had the need to cluster unstranded data. Actually, it looks like I over-rode the r=1 default because my use case was to cluster Vh segments, and treating these data as unstranded would return invalid results, https://github.com/amcdavid/CellaRepertorium/blob/27c07abfad7f44b31141057119a6c5c8860e33fd/R/cdhit-methods.R#L77

Do you have a reproducible example? As a next step, I would try calling cdhit directly, via cdhitC e.g. https://github.com/amcdavid/CellaRepertorium/blob/27c07abfad7f44b31141057119a6c5c8860e33fd/R/cdhit-methods.R#L89.

SamaZYX commented 2 years ago

Hi @amcdavid ,

My case is not related to VH nor Single Cell experiments. I was just trying to find if someone that had already made an implementation of CD-HIT in R for general use.

Unfortunately, I am no expert on R yet, and by looking at your scripts I just found out that you can add C code in R with Rcpp.

I tested your suggestion of running cdhitestC and cdhit by creating a new functions with the modified -r 1 parameter (see below) but I still get the same result as if I used r=0.

library("CellaRepertorium")
library("Biostrings")
library("dplyr")

cdhitestC2 <- function(opts, name, showProgress) {
  .Call('_CellaRepertorium_cdhitestC', PACKAGE = 'CellaRepertorium', opts, name, showProgress)
}

cdhit2 = function(seqs, identity = NULL, kmerSize = NULL, min_length = 6, s = 1, G = 1,
                 only_index = FALSE, showProgress = interactive(), ...) {
  if(any(width(seqs) < min_length)) stop("Some sequences shorter than `min_length`;
                                           remove these or decrease min_length")
  name = 'CD-Hit'
  uopts = list(...)
  options = list()
  options$i <- tempfile()
  options$s = s
  writeXStringSet(seqs, options$i)
  on.exit(unlink(options$i))
  type = switch(
    class(seqs),
    AAStringSet = 'cdhitC',
    DNAStringSet = 'cdhitestC',
    stop('seqs must be either AAStringSet or DNAStringSet')
  )
  if(type == 'cdhitestC'){ #DNA
    kmerSize = case_when(identity < .8 ~ 4, identity < .85 ~ 5,
                         identity < .88 ~ 6, identity < .9 ~ 7,
                         identity < .95 ~ 9, identity < 1 ~ 10, TRUE ~ 11)
    options = c(options, list(ap = 1, r = 1))
  } else{
    kmerSize = 5
  }
  options$n = kmerSize
  options$G = G
  options$c = identity
  options$l = min_length - 1
  options = c(uopts, options)
  options = options[!duplicated(names(options))]
  options = lapply(options, as.character)
  if(type == 'cdhitC'){
    seq_cluster_index = CellaRepertorium::cdhitC(options, name, showProgress) + 1
  } else{
    seq_cluster_index = cdhitestC2(options, name, showProgress) + 1
  }
  if(only_index) return(seq_cluster_index)
  tibble::tibble(query_name = names(seqs), seq = as.character(seqs),
                 cluster_idx = seq_cluster_index) %>%
    dplyr::group_by(cluster_idx) %>%
    dplyr::mutate(n_cluster = dplyr::n()) %>% ungroup()
}

tbl = cdhit2(nnseq, identity = 1, only_index = FALSE, g=1, l=10, r=1)

I tested the -r 0 and -r 1 in the original program on unix and in there it seems to work.

I can see that cdhitestC is calling the compiled C version of your cdhitest (cdhit-est.cpp) and I thought that the code in my first post might have been related to the issue.

I know that running cdhitest in both strands is not related to your line of work, however, it would be very nice to be able to use cdhit in your package in other lines of work where clustering of unstranded data is needed.

Thanks for your reply. Jose