GoekeLab / bambu-singlecell-spatial

Transcript discovery and quantification for long read single cell and spatial transcriptomics data using Bambu
GNU General Public License v3.0
3 stars 0 forks source link

Status of bambu single-cell/spatial #3

Open ljwharbers opened 1 month ago

ljwharbers commented 1 month ago

Hi!

Amazing effort to also make a single-cell/spatial variant of Bambu, really excited about this and looking forward to testing it.

I was wondering what the status of this is, is it already mature enough for some testing. Won't be for anything publication level (yet) just some initial tests and checking quality of our sequencing.

Many thanks, Luuk

andredsim commented 1 month ago

Hi Luuk,

You are welcome to download the repo and try it based on the information in the documentation already provided. As it is in a developmental stage it is still a bit unstable however and we have had reports of issues on HPCs. If you do have any issues or if parts of the documentation is unclear please let us know here.

Kind Regards, Andre Sim

ljwharbers commented 1 month ago

Hi Andre Sim,

Thanks for your response!

I've given it a try and indeed I suspect there is an issue with the parallelization and the HPC system, since I'm getting:

Error in reducer$value.cache[[as.character(idx)]] <- values : 
  wrong args for environment subassignment

Have you seen this error before? If not, I'm happy to dig a bit deeper into this. Is it possible that you share the updated bambu package? It would be a lot easier for me if I can clone it locally, make changes and then install it from source.

EDIT: The error above might actually stem from going out of memory. I'll try and troubleshoot but either way it would be easier with the bambu version locally that you're currently using.

Best, Luuk

ljwharbers commented 1 month ago

It was indeed a memory issue. When trying to see if running with lowMemory, I came across some new errors. Initially where bambu_assignDist had incorrect number of input files declared vs given in the workflow and after that with bambu_extend:

  --- Start extending annotations ---
  Error: BiocParallel errors
    1 remote errors, element index: 1
    0 unevaluated and other errors
    first remote error:
  Error in MatrixGenerics:::.load_next_suggested_package_to_search(x): Failed to find a rowRanges() method for list objects.
    However, the following packages are likely to contain the missing method
    but are not installed: sparseMatrixStats, DelayedMatrixStats.
    Please install them (with 'BiocManager::install(...)') and try again.
    Alternatively, if you know where the missing method is defined, install
    only that package.
  Execution halted

Again, happy to troubleshoot myself and submit pull requests if/when I find fixes to here and the R package github repo

andredsim commented 1 month ago

Hi Luuk,

I will have to get back to regarding the R package, as I need to check with a few people first.

In the meantime could you please share the command line you used to run the nextflow pipeline for the second example. Also do you have the logs for the first error message. I can't quite place where in the pipeline it is from the error message alone, I am guessing also in bambu_discovery module?

Can I ask how large your input file is (from fastq or bam?) and how much memory you are providing? In our attempts when not in lowMemory, our largest sample used 62.54gb for 105,064,936 reads (91gb Bam file, 146gb Fastq file). This would greatly increase if running multiple samples if --lowMemory is not on.

Regarding your latest issue error message this seems to similar to another issue a user has reported, where the bambu dependencies are not being correctly loaded into the R instance. The rowRanges method is meant to be imported from SummarizedExperiment.

Somethings you could try and I would love to hear if this works/doesn't work for you is adding either of these to the nextflow.config

process.containerOptions = "--no-home"

env { R_PROFILE_USER = "/.Rprofile" R_ENVIRON_USER = "/.Renviron" }

Both of these are meant to stop local R package lists from being used.

I hope this helps, and I appreciate you reporting these errors and your troubleshooting attempts. Always happy to receive more so that we can get this ready for the full release.

Kind Regards, Andre Sim

ljwharbers commented 1 month ago

Hi Andre Sim,

Thanks for getting back to me and thanks for checking if it's possible to get access to the R package. Of course I totally understand that you need to check first if it's alright.

The first error message was due to the bambu_quant process being killed because it ran out of memory. I checked and it's a known error message from BiocParallel after OOM job kill. It's just a very non-informative error message but not related to your pipeline.

Regarding the second error. I just tried to include your recommended settings in the nextflow.config. However, this gave the same error. It didn't matter if I only included the containeroptions, the R env settings, or both.

Regarding the size of my input bamfile, it's relatively small (25GB). However, the memory issues stem from the fact that this comes from a custom spatial pipeline and we have millions of spatial barcodes (most with just a few reads). Therefore, most tools I've tried simply run out of memory, even with 2TB of memory available.

Many thanks again for helping.

Best, Luuk

andredsim commented 1 month ago

Hi Luuk,

Since the first attempt (without --lowMemory) ran through bambu_discovery, did you get some of the early output in the output directory. Namely I am interested in readClassFile.rds and quantData.rds. How big are these files, and would you be willing to send them to me so that I could try run the quantification step? (I would also need extendedAnnotations.rds)

Can you share with me the command used to run the pipeline and how many cores you are using?

I have never run bambu_quant with > 100000 barcodes, it is however a theoretically light memory step, each core runs 1 barcode. We did however run into issues where R would duplicate the whole environment per core including the whole quantData object which I am imaginging with millions of barcodes would be much larger than in my tests. A temporary fix could be to reduce the number of cores, even to 1. This will be slow but hopefully should run.

Regarding the second error (with --lowMemory), I am sorry adding those lines did not solve it. In the work directory should be the singularity image You can run it with singularity shell work/singularity/lingminhao-bambusc-beta.img If you open R library(devtools) load_all("/mnt/software/bambu") rowRanges() Which of the two error messages do you see? Error in rowRanges() : could not find function "rowRanges"

Error in rowRanges() : argument "x" is missing, with no default

This will help me determine if its the image or a local environment issue.

Thanks, Andre

ljwharbers commented 1 month ago

Hi Andre,

For your first point. It did indeed run through bambu_discovery without any issues. The readClassFile.rds and quantData.rds are ~800Mb and 1GB, respectively. I've tried running the pipeline with only 12 cores (hoping that would take less memory) and that in the end used 1.67TB of memory and finally didn't finish because I requested 4 days of running time on the HPC and it timed out in the end. You can download the files here (let me know if there are any permission issues):

The exact command for the 'non-lowmem' mode was:

nextflow run main.nf \
  --bams ${bamfile} \
  --genome ${ref_file} \
  --annotation ${gtf_file} \
  --ncore 12 \
  --outdir output \
  -with-singularity lingminhao/bambusc:beta \
  -resume

The command with low-mem was the same but with --lowMemory true.

Finally, regarding the rowRanges() issue. Following your instructions I get the following error message: Error in rowRanges() : argument "x" is missing, with no default

So it does indeed look like it's potentially an issue on my end (?).

Edit: Although it doesn't recognize other packages that I've installed and it is executed in the container. Going into the container, binding the required files, and then trying to run the bambu_extend process line-by-line, results in the same error. So I feel like it can't be on my end.

Many thanks again, Luuk

andredsim commented 1 month ago

Hi Luuk,

Thanks I have received the files. I will try find time to run these soon to figure out why the memory usage is ballooning so much.

Regarding the lowMemory error. This is very strange, just to clarify, when you used the instructions I sent you received this message? Error in rowRanges() : argument "x" is missing, with no default However when you manually ran the bambu_extend script in the container you then get the original error message? I am wondering if you could do a small test and repeat the line I presume the error emerges on

extendedAnno = bambu(reads = samples, annotations = annotations, genome = "$genome", ncore = $params.ncore, discovery = TRUE, quant = FALSE, demultiplexed = TRUE, verbose = FALSE, assignDist = FALSE) With ncore = 1.

I am wondering if it has something to do with sending processes to different nodes on specific HPC configurations and they do not have access to the singularity environment. You can set samples to something like samples[1:24] so that it does not take long. If this doesn't error out that at least gives me a big clue. however if it still returns the same original error but rowRanges() continues to return Error in rowRanges() : argument "x" is missing, with no default is harder for me to understand. Could binding the files be interfering? If you run the test data do you also encounter this error?

Thanks for your continued patience while we resolve this!

Kind Regards, Andre Sim

ljwharbers commented 1 month ago

Hi Andre,

Sorry for the continuous spam and I really appreciate you helping!

I think I figured out the issue with rowRanges(). I'm not sure if it was intended but in the bambu_extend() process, the readclass file is never actually loaded in and only the value of the path is being given. When I change val(readClassFile) to path(readClassFile) and add a line to actually load in the RDS file, reads = readRDS(samples), and give this as input to the bambu call it works for me. See below the full process now, let me know if I overlooked something that would break things along the way.

process bambu_extend{
    publishDir "$params.outdir", mode: 'copy'

    cpus params.ncore
    maxForks params.ncore

    input:
    path(readClassFile)
    path(genome)
    path(annotation)
    val(bambuPath)
    val(NDR)

    output:
    path ('*extended_annotations.rds')
    path ('*extended_annotations_NDR1.rds')
    path ('extended_annotations_NDR1.gtf')

    script:
    """
    #!/usr/bin/env Rscript

    samples = "$readClassFile"
    samples = gsub("[][]","", gsub(' ','', samples))
    samples = unlist(strsplit(samples, ','))
    reads = readRDS(samples)
    print(samples)

    library(devtools)
    if("$bambuPath" == "bambu") {
        load_all("/mnt/software/bambu")
    } else {
        load_all("$bambuPath")
    }
    annotations <- prepareAnnotations("$annotation")
    if(isFALSE($NDR)){
        extendedAnno = bambu(reads = reads, annotations = annotations, genome = "$genome", ncore = $params.ncore, discovery = TRUE, quant = FALSE, demultiplexed = TRUE, verbose = FALSE, assignDist = FALSE)
    } else{
        extendedAnno = bambu(reads = reads, annotations = annotations, genome = "$genome", ncore = $params.ncore, discovery = TRUE, quant = FALSE, demultiplexed = TRUE, verbose = FALSE, assignDist = FALSE, NDR = $NDR)
    }
    saveRDS(extendedAnno, "_extended_annotations.rds")
    extendedAnno_NDR1 = bambu(reads = reads, annotations = annotations, genome = "$genome", NDR = 1, ncore = $params.ncore, discovery = TRUE, quant = FALSE, demultiplexed = TRUE, verbose = FALSE, assignDist = FALSE)
    saveRDS(extendedAnno_NDR1, "_extended_annotations_NDR1.rds")
    writeToGTF(extendedAnno_NDR1, "extended_annotations_NDR1.gtf")
    """
}

Now it ran completely through with lowmem mode, untill the quantification step. Where it is still either a memory or time issue.

Many thanks again.

Cheers, Luuk

andredsim commented 1 month ago

Ah great so it was a simple fix! Great that you found that.

I was working on a reduction in the memory today and hope to have a branch you can test soon. Essentially it will split the quantData and run those in chunks, to not overload the memory all at once, at the cost of some IO and some runtime. Looking at your data I did see the majority of the barcodes only have 1 or 2 reads. Each of these takes a decent overhead to set up the EM per barcode, which essentially will not be needed due to the low read depth. Therefore I was considering a filter that would skip barcodes below a read threshold and hopefully significantly speed up this step. This shouldn't have a large impact on the majority of experiments, but it depends on what your downstream plans are with the count data. Let me know what you think.

Kind Regards, Andre Sim

ljwharbers commented 1 month ago

Hi Andre,

Perfect, that would be great to test. Regarding the read counts per barcode. I think in my specific case I can not filter out these reads because at a later step I would bin multiple barcodes together. Perhaps I could do the binning before but I'd need to look into this.

Let me know if the branch is ready for testing!

Cheers, Luuk

andredsim commented 1 month ago

Hi Luuk,

If you would like to test out the chunk_quantData branch. It is not documented but it has the --chunk_size parameter which defaults to 16384. It will now split the quantData into chunks of 16384 barcodes and write them to the disk, then only load each chunk at a time to quantify them. It then later will group them together and return the standard output. Let me know if this solves the memory issue. Unfortunately this will not resolve the time issue (I didn't implement the filter). Note: do not run this branch with --lowMemory mode, I have yet to update that with these changes (ironically).

If you have the option to, I would recommend combining the barcodes together where appropriate (maybe you know which spots belong together via spatial coordinates?). This is beneficial for many reasons, reducing memory and speed problems being a part of that. Ultimately I want to add user defined, barcode combining, as at input parameter, but that will come later.

Hope this helps, Andre

ljwharbers commented 1 month ago

Hi Andre,

Thanks a lot! I'm definitely trying it out this week. Just a quick question to be sure. Is the following line correct:

saveRDS(list(countMatrix = se\$countMatrix[,chunk], incompatibleCountMatrix = se\$incompatibleCountMatrix), paste0("chunks/counts_",i,".rds"))

I would assume it should also get the subset of the incompatibleCountMatrix and not the full one? Like this:

saveRDS(list(countMatrix = se\$countMatrix[,chunk], incompatibleCountMatrix = se\$incompatibleCountMatrix[,chunk]), paste0("chunks/counts_",i,".rds"))

Thanks again! Luuk

andredsim commented 1 month ago

Yes that is correct I overlooked that, thanks for spotting it!

andredsim commented 2 weeks ago

Hi Luuk, just wanted to check how you were going with this?

ljwharbers commented 2 weeks ago

Hi Andre,

Sorry for not updating!

Everything seemed to work fine, the chunking and processing of the chunks did not have any issues. However, the processing time of the large amounts of chunks is very long. My compute cluster has a limit of 7 days and it would easily go over. Now I can resume the workflow without an issue but in a practise this would not be feasible for continuous processing of samples.

My apologies that I totally forgot to update you on this. However, if you have other suggestions or any (upcoming) updates, I am more than happy to test it out on my end and report back.

Best, Luuk

andredsim commented 5 days ago

Hi Luuk,

Great that it works, but I am not happy with the run time. To improve this we have pushed a new version (might be a bit unstable) that should be able to produce single-cell/spot gene counts and unique counts a lot faster for you. It doesn't need to chunk the quantData anymore I hope. Just change the singularity container to lingminhao/bambusc:beta1.1. I would also recommend adding the --noEM argument. If you decide not to use this argument, it will try and cluster the cells based on gene counts and do the EM on that, which might take some time still based on the data. But you probably would want to have more control over the clustering yourself.

Let me know if you have any questions and how it goes.

Kind Regards, Andre Sim

Here is an example on test data

nextflow run ./bambu-singlecell-spatial/main.nf \
      --reads $PWD/bambu-singlecell-spatial/examples/reads_chr9_1_1000000.fa
stq.gz \
        --genome $PWD/bambu-singlecell-spatial/examples/Homo_sapiens.GRCh38.
dna_sm.primary_assembly_chr9_1_1000000.fa \
          --annotation $PWD/bambu-singlecell-spatial/examples/Homo_sapiens.G
RCh38.91_chr9_1_1000000.gtf \
          --chemistry 10x5v2 \
        --ncore 1 --outdir output --noEM \
              -with-singularity lingminhao/bambusc:beta1.1