More customisable barcode patterns

ChangqingW commented 10 months ago

Implement more customisable barcode patterns: std::vector<std::pair<std::string, std::string>> search_pattern is now a vector of pairs, pair.fist is the name of an pattern element (e.g. UMI, BC) and pair.second is the IUPAC codes for the element (e.g. 16 Ns for BC). This will handle custom protocols where the barcode and UMI are re-ordered, additional adaptors between BC and UMI (e.g. BGI's Stereo-seq) and patterned BC / UMIs with IUPAC ambiguous codes (e.g. NNNYRNNN).

For the Rcpp port I used a named string vector, which translate to vector of pairs fairly painlessly, but I am not sure how this should be done for the CLI. For now I hijacked your command line options -b and -u and replaced -l -r with -x, so the default could be specified as -x CTACACGACGCTCTTCCGATCT -b NNNNNNNNNNNNNNNN -u NNNNNNNNNNNN -x TTTTTTTTT (the order matters). I wonder if a JSON file would be more suitable, or should we stick to CLI options.

nadiadavidson commented 10 months ago

Thanks @ChangqingW, this looks very useful! So to confirm, this would allow a user to switch the search order of the UMI and BC with options like: -x CTACACGACGCTCTTCCGATCT -u NNNNNNNNNNNN -b NNNNNNNNNNNNNNNN -x TTTTTTTTT ? or add sequence between the barcode and UMI with something like: -x CTACACGACGCTCTTCCGATCT -b NNNNNNNNNNNNNNNN -x YR -u NNNNNNNNNNNN -x TTTTTTTTT ?

I'll do some testing to see how it goes. I think keeping these options as CLI is best as we want to keep it easy and intuative to run.

ChangqingW commented 10 months ago

this would allow a user to switch the search order of the UMI and BC with options like: -x CTACACGACGCTCTTCCGATCT -u NNNNNNNNNNNN -b NNNNNNNNNNNNNNNN -x TTTTTTTTT ? or add sequence between the barcode and UMI with something like: -x CTACACGACGCTCTTCCGATCT -b NNNNNNNNNNNNNNNN -x YR -u NNNNNNNNNNNN -x TTTTTTTTT ?

Yes, if there is not indels then this works fine, but with actual reads the indels will result in the aligned UMI shorter / longer than the pattern, but since we always output the same length as the pattern, the outputted UMI could violate the specified pattern. e.g. with -u NNNNNNYRNNNN a lot of the output UMI will not have YR in the specified location, as this is tolerated by the -f edit distance. Or with -u NNNNNNNNNNNN -x YRTTTTTT, a lot of UMI will inclulde the YR bit or even the polyT bit.

I recon to keep other downstream programs compatible, we should keep outputting UMIs in the exact length as specified, but we could include another column in the stats TSV (or even anther field in the fastq identifier) to output the UMI according to Edlib's alignment, which might slightly improve UMI collapsing?

Here's a snippet I used to test the -x CTACACGACGCTCTTCCGATCT -b NNNNNNNNNNNNNNNN -x YR -u NNNNNNNNNNNN -x TTTTTTTTT case

# functions to generate random bases
sample_N <- function(n) {
  if (length(n) > 1) {
    n <- sample(n, 1)
  }
  paste0(
    sample(c("A", "T", "C", "G"), n, replace = T),
    collapse = ""
  )
}

sample_Y <- function(n) {
  paste0(
    sample(c("C", "T"), n, replace = T),
    collapse = ""
  )
}

sample_R <- function(n) {
  paste0(
    sample(c("A", "G"), n, replace = T),
    collapse = ""
  )
}

# -x CTACACGACGCTCTTCCGATCT -b NNNNNNNNNNNNNNNN -x YR -u NNNNNNNNNNNN -x TTTTTTTTT
example_1 <- function(...) {
  BC <- sample_N(16)
  UMI <- sample_N(12)

  # add insertion / mutation to primer
  primer <- strsplit("CTACACGACGCTCTTCCGATCT", "")[[1]]
  primer[sample(seq_along(primer), 1)] <- sample_N(1:2)
  primer <- paste0(primer, collapse = "")

  read <- paste0(
    c(sample_N(5:10), primer, BC, sample_R(1), sample_Y(1), UMI, "TTTTTTTTT", sample_N(5:10)),
    collapse = ""
  )
  return(c(read, BC, UMI))
}

df <- sapply(1:100, example_1) |>
  t() |> 
  as.data.frame() |>
  setNames(c("read", "BC", "UMI"))

# write to test.fasta, with >BC_UMI as identifier field
fp <- file("test.fasta", open="wt")
for (i in seq_len(nrow(df))) {
  writeLines(
    c(
      paste0(c(">", df$BC[i], "_", df$UMI[i]), collapse = ""),
      df$read[i]
    ),
   fp 
  )
}
close(fp)

# also write a BC allow list for flexiplex
fp <- file("bc_test", open="wt")
writeLines(df$BC, fp)
close(fp)

nadiadavidson commented 10 months ago

Thanks @ChangqingW!

So if there is an indel in the left flank will this throw out the UMI location? Or is that adjusted like the current version? I suspect that UMIs may need to be corrected later on (ie. after the gene/transcript is known) to deal with collapsing properly. Something for the todo list perhaps!

I'm just testing for general sequence searching (no barcodes or demultiplexing) )see here. I'm getting a lot of "matches" to reads with high "N" content. These are the options I'm using: flexiplex -x

-k "?" -f 2 -b "" -u "" -i false <.fastq file> Options for the current version are: flexiplex -l -k "?" -r "" -f 2 -b 0 -u 0 -i false <.fastq file> Is there a better set of options for this scenario? Or should something other than "N" be used for wildcards? I had been passing "?" to the edlib comand to signify wildcard sequence.

Will let you know once I have results for the demultiplexing tests.

Cheers, Nadia.

ChangqingW commented 10 months ago

I'm just testing for general sequence searching (no barcodes or demultiplexing) )see here. I'm getting a lot of "matches" to reads with high "N" content. These are the options I'm using: flexiplex -x -k "?" -f 2 -b "" -u "" -i false <.fastq file> Options for the current version are: flexiplex -l -k "?" -r "" -f 2 -b 0 -u 0 -i false <.fastq file> Is there a better set of options for this scenario? Or should something other than "N" be used for wildcards? I had been passing "?" to the edlib comand to signify wildcard sequence.

I can just add the "?" equality back to edlib -- I did not know using N instead would make a difference. I have not tested the simple search case, could you share a sample fastq file so I can work out how to make results consistent with previous releases?

So if there is an indel in the left flank will this throw out the UMI location? Or is that adjusted like the current version? I suspect that UMIs may need to be corrected later on (ie. after the gene/transcript is known) to deal with collapsing properly. Something for the todo list perhaps!

Updated this in #7552a59. When the BC is immediately next to (either to the left or right) the UMI, it will be the same as your previous implementation, where the endDistance from edit_distance is used to infer the UMI location. When there are sequences in between them, the edlib alignment will be used to infer the UMI location -- this is probably ok-ish when the sequence between BC and UMI is not wildcards, but otherwise very unreliable.

nadiadavidson commented 8 months ago

Tests all looking good now that "?" is back. Will accept the request and then I think we can bump the version (I suggestion to 1, since the options have changed so dramatically.

DavidsonGroup / flexiplex

More customisable barcode patterns #23