De-pooling pipeline - Githubissues

amadeovezz commented 10 months ago

What

This PR de-pools, pooled single-cell GEX data.

Note: this pipeline was written in snakemake.

https://github.com/UCSF-DSCOLAB/data_processing_pipelines/issues/37

How to review?

Checkout the code (a good place to start is the README.md)
Run the de-pooling pipeline on a small subset of data (run instructions are in README.md)
Run the tests (test instructions are in the README.md

TODO

[x] Install miniforge to /krummellab/data1/software/.
[x] Install snakemake using mamba.
[x] README.md
[x] R script for de-pooling seurat object
[x] Snakefile
[x] Run on slurm
[x] Sort out snakemake slurm configurations
[x] run_pipeline.py script
[x] Upgrade to snakemake 8.4.0
[x] Tests

Slurm Notes

Currently the standard resource configurations for slurm (memory, cpu, etc) appear to work with Snakemake > 8.

However, the non standard resource definitions (passing auxiliary arguments to slurm like exclude) have some bugs:

https://github.com/snakemake/snakemake-executor-plugin-slurm/issues?q=is%3Aissue+slurm_extra

The good news is that these are actively being looked into so hopefully a fix is within a week away.

amadeovezz commented 9 months ago

Its probably not a bad idea to iterate on this pipeline with a couple of PR's.

I am thinking the central focus of this PR should be to:

Get an end to end snakemake pipeline working
Depool all .rds files

@dtm2451 @erflynn can you confirm the desired results below? @dtm2451 gave a pretty detailed breakdown on #37 but spelling out all the inputs and outputs with some concrete files has been helpful for my understanding:

So for .rds files we have:

Final demultiplexed rds objects with cell x gene matrices and metadata. Ie:./automated_processing/TEST-POOL-DM1-SCG1_raw.rds.

The intention here is we would like to gather all the libraries per pool and then create new seurat objects that are partitioned by sample id (for demuxlet). We will save this to a new *depooled.rds file.

Doublet finder rds objects with cell x gene matrices, and additional doublet finder related meta data. Ie: ./finding_doublets/TEST-POOL-DM1-SCG1_seurat_object_findingDoublets.rds. The cell by gene matrices and some columns (ike nCount_RNA and nFeature_RNA) are identical to the .rds object above

Should I combine this additional doublet finder meta data into the depooled objects?

Doublet finder rds objects with a bit of metadata ./finding_doublets/TEST-POOL-DM1-SCG1_seurat_object_findingDoublets.rds .

Doesnt seem like too much is in this object. Should we ignore?

erflynn commented 9 months ago

the doubletfinder output is used to produce the automated_processing/TEST-POOL-DM1-SCG1_raw.rds output -- so I'd just ignore that bit. I do think it would be useful to depool the _filtered.rds and _processed.rds in automated processing in addition to the raw.

dtm2451 commented 9 months ago

I agree with Emily. Not worth outputting ./finding_doublets/*depooled_seurat_object_findingDoublets.rds objects too as I don't think they'll be used for anything and the important bits from the findingDoublets objects are made available elsewhere. There's no attribute in the sc_seq/sc_seq_pool models for this object either.

amadeovezz commented 9 months ago

what is the best way to demultiplex a doublet?

for instance:

orig.ident nCount_RNA nFeature_RNA DROPLET.TYPE BEST.GUESS percent.mt percent.ribo
AGTCGCCTTTACCT-1 TEST-POOL-DM1-SCG1      26678         4830          DBL        1,0  2.0803658     30.72944

the BEST.GUESS here is both 1,0.

I could duplicate this row such that there exists a entry for sample 1 and 0?

dtm2451 commented 9 months ago

Better to simply remove doublets so there's no out-of-sample cells to worry about if a downstream user is eventually only meant to have access to 4 out of 8 samples in a given pool.

There's little use for having the doublets retained in these depooled samples, so I suggest only keeping the singlets. You could even add "singlets_" to the .rds object names :shrug:.

dtm2451 commented 6 months ago

I was trying to just take away the 'requested changes' block without actually 'Approving' since I do plan to test, but oh well...

Don't feel like you need to wait if you are confident, but I will be able to try testing this on some real data in the next week or so!

amadeovezz commented 5 months ago

bit of a brain dump here:

I was trying to get # slurm_extra: "exclude c4-n20" working and also use a the conda distribution being used by the environment modules on c4.

I was able to successfully install snakemake and its dependencies with the conda distribution, and resolved the slurm_extra problem, but for some reason the --singularity-args flag for snakemake doesnt seem to work anymore?

The error is:

__main__.py: error: argument --apptainer-args/--singularity-args: expected one argument. Ie from the command:

snakemake --profile /c4/home/amazzara/data_processing_pipelines/single_cell_RNAseq_snake/profiles/generic --snakefile pipelines/depooling/Snakefile --use-singularity --singularity-args "--bind=/krummellab/data1/amazzara/tutorial_lib_sep/data/single_cell_GEX/processed/"

I've tried many attempts at figuring out why this is happening, but am quite stumpted at the moment. Will circle back a bit later as a snakemake environment will be needed for some other projects as well

erflynn commented 4 months ago

oof looks frustrating! these settings are always a pain to figure out, thanks for looking into

amadeovezz commented 4 months ago

more info:

it appears that the bug itself lies in the combination of slurm + singularity. However if i invoke the pipeline without the slurm profile, everything seems to work fine. To investigate further...

UCSF-DSCOLAB / data_processing_pipelines

De-pooling pipeline #53

What

How to review?

TODO

Slurm Notes