chris-mcginnis-ucsf / DoubletFinder

R package for detecting doublets in single-cell RNA sequencing data
413 stars 109 forks source link

Does DoubletFinder generate different result on different run of find.pK()? #117

Closed camelest closed 3 years ago

camelest commented 3 years ago

Hi, first of all, thank you so much for maintaining this wonderful tool.

I have a question regarding the reproducibility of the optimal pK identification.

I have a 9,500 cells single-cell dataset and I followed your tutorial. When I ran the DoubletFinder, it gave me a curve of the first figure. I felt it's not typical to get the 2 peaks of pK, and then for confirmation, repeated the analysis and got the second figure. image 図1 I didn't change the Seurat object it self and I was wondering whether it's possible that DoubletFinder gives different peaks of pks on different run.

My questions are:

  1. Is it due to the random creation of artificial doublets? or do I miss something?
  2. If so, can we somehow fix the random seed to ensure reproducibility of the analysis?
  3. In this particular case with switching between 2 peaks, which would you choose as an optimal pK for downstream analysis?
  4. Is there any other part that potentially a random seed would affect except find.pK? Do we always get the same final result with same pK, pN and expected doublet rate?

Thank you so much in advance for your help.

Best

f6v commented 3 years ago

I've got the same question. This results in different doublets:

Screen Shot 2021-10-14 at 11 36 25 Screen Shot 2021-10-14 at 11 39 05
chris-mcginnis-ucsf commented 3 years ago

Hi @camelest and @f6v -- thanks for reaching out.

Yes, the each run of DoubletFinder will be slightly distinct due to the randomness of artificial doublet generation and downstream neighborhood detection. When I see bimodal bcmvn distributions, I usually interrogate each threshold and use my knowledge of the dataset to choose the correct one. I'll note from your plots above that while the amplitude of the peaks is different between the runs, the actual locations of the peaks are the same (e.g., 0.09 and 0.26). So I would try DoubletFinder using these two parameters and then look deeply into the data to choose the right one (it should be somewhat obvious, e.g., if one of the pK values results in a lot of 'known' singlets being called as doublets).

Chris

camelest commented 3 years ago

Hi, Chris @chris-mcginnis-ucsf

Thank you so much for your reply. I just want to confirm one thing: I understood that the find.pK contained some randomness. Do we get identical doublet results if we use the same input of pK (and pN and expected doublet rate) or is there any randomness in this part as well? Thank you for your kind help.