Closed BijnBau closed 3 years ago
Hi @BijnBau,
Thank you for using scruff. No, parallelization is enabled in current release versions on Bioconductor. We have not made any changes to parallelization. In order to help with this issue, can you provide the following information?
sessionInfo()
after loading library(scruff)
?parallel::detectCores()
in R.scruff
function call you provided above? This will greatly help with debugging.In addition, there might be some issues with the command you provided.
Read1
and Read2
parameters you provided above are the same. This might be a typo. In order to successfully run scruff, read1Path
should be the path to the read file with cell barcode and UMI sequence information and read2Path
should be the path to the read file containing the transcript sequence.bcEdit=5
is kind of high. It significantly increases computation complexity. Personally I would not set such a high threshold for allowing cell barcode mismatch correction. Reducing this number to maybe 1 might reduce the time needed for this part of computation.I think I should mention that scruff is for preprocessing scRNA-seq reads generated from plate-based FACS-sorted protocols (such as CEL-Seq) with predefined list of cell barcodes. This means we know beforehand the reads from certain cell barcodes are true cell-associated reads and the reads from certain cell barcodes (if any) should not contain reads from cells. This is different from the protocol from 10X Genomics where the cell barcodes associated with cell containing droplets are inferred by a cell calling algorithm.
Thank you again for using scruff. Hope this is helpful to you.
Dear @zhewa
I have used our suggestions to adapt the code. As I have no definite errors to work around despite am working to extend scruff to our dataset, I will close this thread.
Thank you very much for your kind suggestions!
Dear scruff team,
I have been trying to implement your tool for my analysis of a 10X library. I have been using the option to parallelize the demultiplexing. However, I ran into the problem that this just does not work. When I check the cores demultiplex runs on, then it will only use 1. It seems to me that parallelization has been included in previous versions but not in the current one? Is this correct?
In addition, can you help with the amelioration of the current speed? I am planning to process 135.698.545 reads which will take almost a week and would love to reach the speeds mentioned in your publication.
Thank you for your help and response! I have appended my code below.
`library(scruff) library(parallel)
Read1 <- "10X/Thesis_data/TCells_Splice/sample/sample_L001_R1_001.fastq" Read2 <- "10X/Thesis_data/TCells_Splice/sample/sample_L001_R1_001.fastq" Fasta <- "References/refdata-gex-GRCh38-2020-A/fasta/genome.fa" Barcodes <- scan("10X/Thesis_data/TCells_Splice/sample/barcodes.tsv", what="list") Reference <- "/References/refdata-gex-GRCh38-2020-A/genes/genes.gtf" indexBase <- "GRCh38"
sample <- scruff( project=paste0("sample", Sys.Date()), experiment=c("sample"), lane=c("L001"), read1Path=c(Read1), read2Path=c(Read2), Barcodes, index=indexBase, Reference, bcStart=1, bcStop=16, bcEdit=5, umiStart=17, umiStop=26, keep=75, celPerWell="4489", nBestLocations=1, minQual=10, yieldReads=1e+06, alignmentFileFormat="BAM", demultiplexOutDir="./Demultiplex", alignmentOutDir="./Aligment", countUmiOutDir="./Count", demultiplexSummaryPrefix="Demultiplex_MS1390", alignmentSummaryPrefix="AlignmentMS1390", countPrefix="countUMI", logfilePrefix= format(Sys.time(), "%Y%m%d%H%M%S"), overwrite=FALSE, verbose=TRUE, cores=max(1,parallel::detectCores()-2), threads=1)
saveRDS(sample, "10X/Thesis_data/TCells_Splice/sample/sample.rds")`