campbio / decontX

Methods for decontamination of single cell data
MIT License
29 stars 1 forks source link

decontPro fails for some samples with strange error! #6

Closed naila53 closed 1 year ago

naila53 commented 1 year ago

Hi,

thanks for the amazing tool. However, I've been having a hard time running it on all my samples. For most of my samples, it fails either at the beginning image

or at the end showing me this message: image

for the 3 samples it worked, the denoised values make more sense and improve clustering and differential expression analysis, that's why i got excited to implement it on all my samples. They do vary in cell number 400-30k cells, I'm running with default parameters as follows: image

but it failed with samples that had 5k cell, 400 cells and 30k cells. I did think bout varying the priors since you mentioned in your tutorial that maybe smaller samples could benefit from larger priors but it did fail for the aforementioned samples with different total number. what do you recommend to change?

I would appreciate any insights! thanks!

yuan-yin-truly commented 1 year ago

Hi @naila53, thanks for trying out the tool!

I think the first error message is saying some of the droplets have 0 library sizes (i.e. total counts). You can try to remove those droplets with total counts == 0 and try again.

The second error message looks alright until bash killed. "MEDIAN ELBO CONVERGED" and "Drawing a sample of size 1000..." are both expected. Could you send more of your log for us to see why the job was killed midway?

The default value for delta_sd is 2e-5 and background_sd is 2e-6. And we used them on datasets of about 5k - 10k with panels of about 10 - 30 antibodies. To run with these default priors, just do: out <- decontPro(counts, clusters). We used 2e-4 and 2e-5 in our vignette because in the vignette, we sampled 1k droplets from an original 10k droplets dataset, so we relaxed the priors a bit. You are welcome to try both sets of prior and see how they work out for your dataset.

joshua-d-campbell commented 1 year ago

@yuan-yin-truly

The default value for delta_sd is 2e-5 and background_sd is 2e-6. And we used them on datasets of about 5k - 10k with panels of about 10 - 30 antibodies. To run with these default priors, just do: out <- decontPro(counts, clusters). We used 2e-4 and 2e-5 in our vignette because in the vignette, we sampled 1k droplets from an original 10k droplets dataset, so we relaxed the priors a bit.

Can you clarify this in the vignette? Ideally users would be able to copy the code from vignette directly without having to adjust it much. Otherwise we need to explain how to choose a good prior (or adjust it based on less-than-ideal results).

naila53 commented 1 year ago

@yuan-yin-truly i think the bash error i got because i was running the Rscript through snakemake based pipeline where I specified multiple available cores for running all the samples. Because when I ran again the R script for the samples individually it seems to work fine, it completed successfully for one of the samples now,, I don't know what parallelism and using snakemake has to do with this error, it doesn't say much other than bash killed it, what I posted up there so I don't know what other logs I can provide..

naila53 commented 1 year ago

@yuan-yin-truly i have one sample where i tried running the tool 3 times and it takes 4 hours to eventually get killed at the end...any ideas? image

this is what the protein library size looks like for this sample that has about 9k cell, the red line is at lgo10 value of 2.5,, i know this sample is not the best and it looks very noisy as most cells have log10(total protein counts) between 2 and 2.5. I did discard all cells that have total counts below 100 ( log10 of 2) image

yuan-yin-truly commented 1 year ago

@naila53 one possibility for the job to be killed at sampling completion stage is running out of memory. Did you run the samples on your personal laptop? Do you have access to any computing clusters? If you are affiliated to any educational institution, there usually is one and it is worth asking around.

naila53 commented 1 year ago

@yuan-yin-truly thanks! yes it was a memory issue although i had sepcified 128GB for the session, still wasn't enough for some datasets. Anyways, i have a different question regarding clustering, did you CLR normalize the data after running decontPro on proteins or is using decontPro values enough for PCA and clustering? in the manuscript you mention "The clustering of datasets was generated using the Seurat package. Silhouette widths were calculated on datasets after CLR normalization and averaged by clusters." but it's not clear whether CLR was preformed before or after clustering. Many thanks!

yuan-yin-truly commented 1 year ago

@naila53 decontPro returns decontaminated counts. So they are still on the "original counts" scale. If you need to do PCA or clustering on the decontaminated counts, you may still want to scale them.

In our study, we generated cluster labels using the original counts per the Seurat clustering workflow, and CLR was part of the workflow. The cluster labels were then used in decontaminating the data. When calculating silhouette width before and after decontamination, we used the same cluster labels.

Hope this clarifies!