broadinstitute / Drop-seq

Java tools for analyzing Drop-seq data
MIT License
119 stars 34 forks source link

Dropulation doublet rate very high #339

Open terencewtli opened 1 year ago

terencewtli commented 1 year ago

Hi,

Thanks for providing this tool! We have four individuals multiplexed in a 10X Multiome fibroblast to stem cell reprogramming experiment. I'm writing because I have some strange demultiplexing results - almost all of the doublet p-values are >= 0.9 for both ATAC and RNA. I used the parameters as recommended in this discussion (namely MAX_ERROR_RATE = 0.05): https://github.com/broadinstitute/Drop-seq/issues/321. I've attached the parameters for RNA/ATAC below.

drop_params.txt

In the previous discussion, you mentioned that doublets are the cells where doublet_pval >= 0.9. However, using that threshold, we get ~98% of the cells called as doublets. Even using the doublet_pval threshold as 1.0, we see a high doublet proportion:

image

I was wondering if you had ideas regarding this behavior, perhaps further tuning MAX_ERROR_RATE? Let me know if you'd like me to provide additional details, such as the full output of dropulation.

jamesnemesh commented 1 year ago

Hi,

I have not seen that sort of behavior. I have seen cases where the cell barcodes captured had very few informative UMIs, and doublet detection (and donor assignment) had a hard time figuring out what was going on. For example, if you try to run donor assignment on “empty” cell barcodes (where only cell free RNA is captured) performance of those cells is pretty bad.

Is that the result of one cell capture experiment? I ask because that’s a very large number of cells to have collected within a single run. If so, I would look at cell selection - you wouldn’t expect to capture that many cells in a single experiment. How many informative UMIs do you have for each cell for both AssignCellsToSamples and DetectDoublets - it would be good to have 200-500 informative UMIs per cell. If your number of informative UMIs is low (the programs output that as a column), then there may be problems with the sequencing depth (or other library prep) of your data, or your VCF may be too sparse. I would try plotting histograms of the number of informative UMIs for each program’s output. You might also create scatter plots with one axis being the total number of UMIs for a cell, and the other axis being the number of informative UMIs. You should see something like 20-50% of your UMIs for each cell are informative.

How many donors are in your VCF? Is your VCF from array data, or WGS? Did you attempt to demultiplex with all donors or a restricted list? When I run AssignCellsToSamples, I like to have the program consider many donors as possible assignments and be “blind” to which donors are actually in the pool, so I can validate that the donors assigned are the correct ones. If things are working, AssignCellsToSamples should be able to tell you which donors are in your pool without any “hints”, and do so before doublet detection.

Hopefully a few diagnostic plots will help us get a little closer to what’s going on here.

-Jim

On Jun 23, 2023, at 3:32 PM, Terence Li @.***> wrote:

Hi,

Thanks for providing this tool! We have four individuals multiplexed in a 10X Multiome fibroblast to stem cell reprogramming experiment. I'm writing because I have some strange demultiplexing results - almost all of the doublet p-values are >= 0.9 for both ATAC and RNA. I used the parameters as recommended in this discussion (namely MAX_ERROR_RATE = 0.05): #321 https://github.com/broadinstitute/Drop-seq/issues/321. I've attached the parameters for RNA/ATAC below.

drop_params.txt https://github.com/broadinstitute/Drop-seq/files/11851525/drop_params.txt In the previous discussion, you mentioned that doublets are the cells where doublet_pval >= 0.9. However, using that threshold, we get ~98% of the cells called as doublets. Even using the doublet_pval threshold as 1.0, we see a high doublet proportion:

https://user-images.githubusercontent.com/42584149/248375992-61ed4d2b-68c1-44be-b0e1-06a5c3910f39.png I was wondering if you had ideas regarding this behavior, perhaps further tuning MAX_ERROR_RATE? Let me know if you'd like me to provide additional details, such as the full output of dropulation.

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/Drop-seq/issues/339, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCZXJ2Q44IPEJPBMEIMWTTXMXVODANCNFSM6AAAAAAZR6PKJM. You are receiving this because you are subscribed to this thread.

terencewtli commented 1 year ago

Thanks for the fast response!

These results are coming from three different runs (across different timepoints in the experiment) - here are the cell numbers from CellRanger:

Timepoint 1: 9487 Timepoint 2: 21010 Timepoint 3: 37402

There's some issues with the mapping for timepoints 2/3, the coverage per cell is much lower in those than in timepoint 1. I think that's what's happening for the doublet detection. Here are the plots for the number of informative UMIs and total vs. informative UMIs - the top left panel is across all ~70k cells across all three timepoints, and the following panels are split by timepoint:

image image

We have four individuals multiplexed in each timepoint, and the donor VCF I'm giving as input only includes those four individuals. It's array data, and we imputed SNPs with the TopMed server.

If I'm remembering correctly, we originally expected around 12,000-15,000 cells, but CellRanger ends up calling a lot more for timepoints 2/3 - would it be informative to limit the number of total cells that it can detect?

jamesnemesh commented 1 year ago

Hi!

Those are some helpful plots! To give you an idea of what I normally would expect from a decent data set, I’m sharing a few of our standard plots:   

As you can see, we have a fairly large number of informative UMIs for each cell, and the doublet p-value distribution is nicely bimodal - you could set the threshold almost anywhere between 0.1 and 0.9 and get the same result. A large proportion of the cells have the max (based on floating point math) scores, and few cells have FDRs > 0.05. This output is from a random experiment, and is fairly typical.

If you restricted your analysis to the set of cell barcodes that have at least 500 informative UMIs, does the data behave a bit better?

When you ran imputation, did you filter out low quality sites? I would filter to an R2 score >=0.9. We’ve found in practice that works quite well - the genotype data for the experiment above was generated with an array + imputation + filtering to high quality sites.

-Jim

On Jun 23, 2023, at 5:32 PM, Terence Li @.***> wrote:

Thanks for the fast response!

These results are coming from three different runs (across different timepoints in the experiment) - here are the cell numbers from CellRanger:

Timepoint 1: 9487 Timepoint 2: 21010 Timepoint 3: 37402

There's some issues with the mapping for timepoints 2/3, the coverage per cell is much lower in those than in timepoint 1. I think that's what's happening for the doublet detection. Here are the plots for the number of informative UMIs and total vs. informative UMIs - the top left panel is across all ~70k cells across all three timepoints, and the following panels are split by timepoint:

https://user-images.githubusercontent.com/42584149/248408709-319598d5-df9d-4a98-bebb-423f6160c04e.png https://user-images.githubusercontent.com/42584149/248408721-a59ec904-06d2-419f-94db-15ab7e682a18.png We have four individuals multiplexed in each timepoint, and the donor VCF I'm giving as input only includes those four individuals. It's array data, and we imputed SNPs with the TopMed server.

If I'm remembering correctly, we originally expected around 12,000-15,000 cells, but CellRanger ends up calling a lot more for timepoints 2/3 - would it be informative to limit the number of total cells that it can detect?

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/Drop-seq/issues/339#issuecomment-1605004978, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCZXJ7GPGYG77KU5KME6ELXMYDPLANCNFSM6AAAAAAZR6PKJM. You are receiving this because you commented.

Angel-Wei commented 8 months ago

Hi @jamesnemesh and @terencewtli ,

Thank you for your inputs on this post. I wonder if @terencewtli has any updates or solutions of resolving this problem as I have really similar observations. Take one of my multiplexed sample for example (sequenced by NovaSeq X, ~25K cells after initial QC filtering when creating a Seurat object), cells in this multiplexed sample are from 4 individuals from 4 cell lines. Here's a bit of break down my procedures and the usage of AssignCellsToSamples and DetectDoublets pipelines:

Any suggestions/inputs on these observations are much appreciated! Thank you so much!

jamesnemesh commented 8 months ago

SAMPLE_FILE was used to restrict the donor lists.

We suggest running AssignCellsToSamples without this argument. The program should be able to figure out who is in your pool without hints. Further, if you made a mistake in your donor list and provided the wrong set of donors, the algorithm will be restricted to the wrong answers and try to maximize them anyway, resulting in miserable FDR pvalues. You should use this for DetectDoublets to force the program to consider those donors as possible contributors to doublets, in addition to any donors discovered by AssignCellsToSamples, just to cover an odd edge case where a donor is never called as the most likely for any cell but somehow is a smaller cell of a doublet - you'd have to invoke some strange situation of cells with very different sizes for this to happen, but better safe. For the most part, the programs infer this information, and you'll feel better with the calls knowing you didn't have to tell the program which donors to expect in your pool.

Your number of UMIs (nCount_RNA) seems reasonable, but translates to very few informative UMIs in the second panel (num_umis). The algorithm is starved for data for most cells, so you're not getting very good results. We typically see something on the order of 20-40% of our total UMIs per cell have transcribed SNPs, but it looks like your conversion is far lower.

Is your VCF from whole genome sequencing? If not, is it a SNP array where you've also performed imputation? Imputation is a great way to boost the number of sites you can interrogate, but it's critical to filter to high quality sites (INFO>=0.8). Is this nuclei or whole cell data? Have you enabled intronic reads via LOCUS_FUNCTION_LIST=INTRONIC, which can have a dramatic impact on the number of UMIs when most UMIs are intronic? Cell Ranger by default uses the equivalent of that option to calculate expression, which might account for some of the difference if you aren't using it - you might be losing as much as 70-80% of your UMIs.

To set your expectations, Here's a random experiment UMI yield (should be the same as your panel nCount Panel):

image

And the number of informative UMIs (should be the same as your num_umis panel):

image
Angel-Wei commented 8 months ago

Hi James,

Thank you so much for your inputs! While I'm testing some changes and checking VCFs, I have a quick question regarding the --VCF_OUTPUT flag in AssignCellsToSamples pipeline. Initially, I used the same input VCF file for running both AssignCellsToSamples and DetectDoublets. However, I noticed that, as mentioned in drneavin's commands in #321, AssignCellsToSamples enables outputting another VCF file when the program was completed and DetectDoublets continue using the file from --VCF_OUTPUT. I tried incorporated this in a test run, though I'm still seeing many doublets, but the number of Singlet calls increased a bit. I wonder is --VCF_OUTPUT flag mandatory to be invoked by DetectDoublets later and both pipelines cannot use the same VCF file?

jamesnemesh commented 8 months ago

AssignCellsToSamples streams the VCF file and does not store all genotypes in memory. It looks through both the BAM file and VCF, and emits only variants that are transcribed and donors that are relevant to the pool to the output VCF.

DetectDoublets needs to optimize over all genotypes, so stores the genotype information in-memory. Using the output VCF from AssignCellsToSamples as the input to DetecDoublets will save you memory and computation time.

Technically, using either the 'raw' or processed VCF as input to DetectDoublets should result in the same answer, but I haven't tested that at any length in recent years - in larger cell and pool sizes the memory gets prohibitively expensive. All of my validation work on the pipelines assumes that you funnel the outputs from Assign into Detect.

Angel-Wei commented 8 months ago

Thank you for the information! I opted out --VCF_OUTPUT in initial attempts assuming it shouldn't affect the results generation. but I can surely add that back to save computing time and memory for the downstream pipeline! Thanks again!