broadinstitute / Drop-seq

Java tools for analyzing Drop-seq data
MIT License
120 stars 34 forks source link

Dropulation high doublet call rate #321

Open drneavin opened 2 years ago

drneavin commented 2 years ago

Hello,

I'm writing because I have recently trialed dropulation on a couple datasets and have found higher doublet rate estimations than anticipated and want to know if some of the parameters that I am using might be incorrect or if additional parameters should be set. I have run dropulation on two different datasets:

I find that doublet estimation to be ~60-70% but it should be between 5 and 20%. I have also run other demultiplexing methods and find the doublet rate to be within the expected range. The commands that I'm currently running are:

AssignCellsToSamples --INPUT_BAM tagged_bam.bam --VCF Merged_MAF0.01.dose_GeneFiltered_hg38_nochr_NoAdditionalchr.vcf --OUTPUT assignments.tsv.gz --VCF_OUTPUT out_vcf.vcf --CELL_BARCODE_TAG CB --MOLECULAR_BARCODE_TAG UB --CELL_BC_FILE barcodes.tsv --SAMPLE_FILE Individuals.txt --FUNCTION_TAG XF --EDIT_DISTANCE 1 --READ_MQ 10 --GQ_THRESHOLD 30 --RETAIN_MONOMORPIC_SNPS false --FRACTION_SAMPLES_PASSING 0.5 --IGNORED_CHROMOSOMES X --IGNORED_CHROMOSOMES Y -IGNORED_CHROMOSOMES MT --ADD_MISSING_VALUES true --DNA_MODE false --SNP_LOG_RATE 1000 --GENE_NAME_TAG gn --GENE_STRAND_TAG gs --GENE_FUNCTION_TAG gf --STRAND_STRATEGY SENSE --LOCUS_FUNCTION_LIST CODING --LOCUS_FUNCTION_LIST UTR --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false

and

DetectDoublets --INPUT_BAM tagged_bam.bam --VCF out_vcf.vcf --SINGLE_DONOR_LIKELIHOOD_FILE assignments.tsv.gz --OUTPUT likelihoods.tsv.gz --CELL_BARCODE_TAG CB --MOLECULAR_BARCODE_TAG UB --CELL_BC_FILE barcodes.tsv --SAMPLE_FILE Individuals.txt --EDIT_DISTANCE 1 --READ_MQ 10 --GQ_THRESHOLD 30 --FRACTION_SAMPLES_PASSING 0.5 --IGNORED_CHROMOSOMES X --IGNORED_CHROMOSOMES Y --IGNORED_CHROMOSOMES MT --FORCED_RATIO 0.8 --TEST_UNEXPECTED_DONORS true --SCALE_LIKELIHOODS true --DNA_MODE false --GENE_NAME_TAG gn --GENE_STRAND_TAG gs --GENE_FUNCTION_TAG gf --STRAND_STRATEGY SENSE --LOCUS_FUNCTION_LIST CODING --LOCUS_FUNCTION_LIST UTR --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false

And then I consider the calls from the DetectDoublets command in the bestSample column to be the droplet calls. Please let me know if you identify anything that could be optimized with this submission or if calling the final cell types should be altered.

Thanks so much for your help!

jamesnemesh commented 2 years ago

Hi!

I would consider doublets the cells where the doublet_pval >= 0.9.

Additionally, you might try to set the MAX_ERROR_RATE to a value like 0.05. We’ve found that if there’s a fair amount of cell free RNA that many cells are assigned as doublets because of the capture of that RNA from other donors. If you abhor an arbitrary threshold and use celbender (https://github.com/broadinstitute/CellBender https://github.com/broadinstitute/CellBender), you can feed in per-cell allele frequency and estimates of the fraction of cell free RNA captured by each cell instead. Those options are captured by CELL_CONTAMINATION_ESTIMATE_FILE and ALLELE_FREQUENCY_ESTIMATE_FILE.

As you’ve correctly identified, without some additional info DetectDoublets is overly sensitive to cell free RNA and overcalls doublets.

-Jim

On Oct 12, 2022, at 8:25 AM, drneavin @.***> wrote:

Hello,

I'm writing because I have recently trialed dropulation on a couple datasets and have found higher doublet rate estimations than anticipated and want to know if some of the parameters that I am using might be incorrect or if additional parameters should be set. I have run dropulation on two different datasets:

74 pools of PBMCs with ~16 individuals per pool and ~20,000 cells captured per pool 11 pools of fibroblasts with ~8 individuals per pool and ~10,000 cells captured per pool I find that doublet estimation to be ~60-70% but it should be between 5 and 20%. I have also run other demultiplexing methods and find the doublet rate to be within the expected range. The commands that I'm currently running are:

AssignCellsToSamples --INPUT_BAM tagged_bam.bam --VCF Merged_MAF0.01.dose_GeneFiltered_hg38_nochr_NoAdditionalchr.vcf --OUTPUT assignments.tsv.gz --VCF_OUTPUT out_vcf.vcf --CELL_BARCODE_TAG CB --MOLECULAR_BARCODE_TAG UB --CELL_BC_FILE barcodes.tsv --SAMPLE_FILE Individuals.txt --FUNCTION_TAG XF --EDIT_DISTANCE 1 --READ_MQ 10 --GQ_THRESHOLD 30 --RETAIN_MONOMORPIC_SNPS false --FRACTION_SAMPLES_PASSING 0.5 --IGNORED_CHROMOSOMES X --IGNORED_CHROMOSOMES Y -IGNORED_CHROMOSOMES MT --ADD_MISSING_VALUES true --DNA_MODE false --SNP_LOG_RATE 1000 --GENE_NAME_TAG gn --GENE_STRAND_TAG gs --GENE_FUNCTION_TAG gf --STRAND_STRATEGY SENSE --LOCUS_FUNCTION_LIST CODING --LOCUS_FUNCTION_LIST UTR --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false and

DetectDoublets --INPUT_BAM tagged_bam.bam --VCF out_vcf.vcf --SINGLE_DONOR_LIKELIHOOD_FILE assignments.tsv.gz --OUTPUT likelihoods.tsv.gz --CELL_BARCODE_TAG CB --MOLECULAR_BARCODE_TAG UB --CELL_BC_FILE barcodes.tsv --SAMPLE_FILE Individuals.txt --EDIT_DISTANCE 1 --READ_MQ 10 --GQ_THRESHOLD 30 --FRACTION_SAMPLES_PASSING 0.5 --IGNORED_CHROMOSOMES X --IGNORED_CHROMOSOMES Y --IGNORED_CHROMOSOMES MT --FORCED_RATIO 0.8 --TEST_UNEXPECTED_DONORS true --SCALE_LIKELIHOODS true --DNA_MODE false --GENE_NAME_TAG gn --GENE_STRAND_TAG gs --GENE_FUNCTION_TAG gf --STRAND_STRATEGY SENSE --LOCUS_FUNCTION_LIST CODING --LOCUS_FUNCTION_LIST UTR --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false And then I consider the calls from the DetectDoublets command in the bestSample column to be the droplet calls. Please let me know if you identify anything that could be optimized with this submission or if calling the final cell types should be altered.

Thanks so much for your help!

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/Drop-seq/issues/321, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCZXJZ2CVCDVAYPB3BMHCDWC2U25ANCNFSM6AAAAAARDHSAPU. You are receiving this because you are subscribed to this thread.

jamesnemesh commented 2 years ago

I should include some lame R code here to give you a sense of how we generate a mapping from cell barcode to donor ID. Someday I should actually document this whole process with a cookbook as I've done for some of our other tools.

createDonorMap<-function (singletFile, doubletFile, outMapFile, fdrThreshold=0.05, doubletPvalueThreshold=0.9) {
    a=read.table(singletFile, header=T, stringsAsFactors = F, sep="\t")
    b=read.table(doubletFile, header=T, stringsAsFactors = F, sep="\t")

    confidentAssignmentCells=a[a$FDR_pvalue<=fdrThreshold,"cell"]
    singletCells=b[b$doublet_pval<doubletPvalueThreshold, "cell"]
    confidentAssignmentSingletCells=intersect(confidentAssignmentCells, singletCells)
    map=a[match(confidentAssignmentSingletCells, a$cell), c("cell", "bestSample")]
    write.table(map, outMapFile, row.names=F, col.names = T, quote=F, sep="\t")
}
drneavin commented 2 years ago

Great, thank you @jamesnemesh! That code will help and I'll try changing the MAX_ERROR_RATE as well.

For what it's worth, I don't think that these pools have unusually high ambient RNA (< 10%). Would that be enough to impact the doublet calling?

Also, I noticed that the default ratio for doublet proportions is 0.8:0.2. I was curious about the reason for this ratio as opposed to 0.5:0.5 and am wondering if altering this would have a large impact?

jamesnemesh commented 2 years ago

Hi,

The way we look for doublets is to optimize a mixture that maximizes the likelihood of the data, and we cap the ratios at 4:1 (80/20). The reason for the cap to be up to (but not including) 1, most cells have enough errors (PCR, cell free RNA, genotyping) that doublet detection would optimize those cells as a mixture of mostly the correct donor and a very small fraction of some random donor that best matched up with the errors observed. The 0.8:0.2 is a tradeoff for how selective we want to be.

In real world data, the size of cells can easily vary by an order of magnitude of UMIs, so it’s pretty plausible to have cells that are not 50/50 mixtures. If you take a look at the first figure of the supplemental, we examine our sensitivity to detect doublets at different proportions. Using in-silico mixed data, the 50/50 doublets are super easy to detect, and you need more data to detect the more skewed doublets. If another algorithm you’re looking is tuned to only detect 50/50 doublets, then no doubt you’ll detect fewer doublets, but I’d bet you’ll miss at least some of the true events. You could certainly try to set a FORCED_RATIO=0.5 and see how that works (I haven’t tested that because of the real world observations.). As for the <10% ambient RNA, I think my particular implementation gets pretty touchy at more than a few % ambient RNA, which is why the MAX_ERROR_RATE is there - it was a “cheap” fix that got our doublet rates from clearly being over-called to being close to the expected numbers.

I should mention that there are two patterns to doublets that you can detect. One are doublets where both the doublet_pval and best_pair_pvalue are > 0.9. We refer to that internally as “confident doublets”. These are cells where the model thinks there are alleles that can’t be explained by the primary donor, AND is highly confident in who the 2nd donor is. There are also cells where the error rate may be higher, but the model can’t decide on a particular donor as the 2nd best of the pair - many donors give similar results. These are cases where doublet_pval>=0.9 and best_pair_pvalue<0.9. We think the estimate of the fraction of doublets that matches up with loading best are the “confident” doublets. In most experiments, the non-confident 2nd donor doublets are a much smaller fraction, and by default we usually toss them to be conservative. However, in some data we’ve looked at recently (multiome data) the errors rates were quite a bit higher, and we found we could retain those non-confident 2nd donor doublets in the data set. This was particularly useful for eQTL analysis, where the additional data in the non-confident doublets improved our ability to detect eQTLs above any additional variance we added to the data - for a different task you might want to be more conservative.

It sounds like you’re doing a bake-off of a few existing algos. You might look at GenerateSyntheticDoublets, which helps you generate BAMs with in-silico mixed doublets - you can select cells from real experiments that you are confident are singlets, then mix them in different proportions and see if your algo of choice can detect them at the given mixture.

In any case, I’m interested to hear how the results go! Let me know if there’s anything else I can help you with.

-Jim

On Oct 12, 2022, at 5:33 PM, drneavin @.***> wrote:

Great, thank you @jamesnemesh https://github.com/jamesnemesh! That code will help and I'll try changing the MAX_ERROR_RATE as well.

For what it's worth, I don't think that these pools have unusually high ambient RNA (< 10%). Would that be enough to impact the doublet calling?

Also, I noticed that the default ratio for doublet proportions is 0.8:0.2. I was curious about the reason for this ratio as opposed to 0.5:0.5 and am wondering if altering this would have a large impact?

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/Drop-seq/issues/321#issuecomment-1276759622, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCZXJ3S62AKD43E6JSXXHDWC4VEJANCNFSM6AAAAAARDHSAPU. You are receiving this because you were mentioned.

drneavin commented 2 years ago

Hi @jamesnemesh,

I'm happy to report that changing the MAX_ERROR_RATE to 0.05 had a dramatic impact on calling doublets! It's still slightly higher (1-2%) than some of the other methods I've tried but it's now well within the expected range (and much speedier than most other methods).

You're right, I'm comparing the different demultiplexing and doublet detecting methods - let me know if there are any additional parameters that you think I should test for dropulation.

Thanks for pointing out GenerateSyntheticDoublets, I'm definitely going to try it out. I was looking at the different parameters and wondered how it deals with barcodes that might be in different bam files when supplying a list of bams and a list of barcodes to include? I guess I'm wondering how it knows which bam file to search for the correct barcode in? Or it just surveys all of them? And what happens if the barcode is in two of the bams?

Thanks again for all your help!

jamesnemesh commented 2 years ago

Hi,

GenerateSyntheticDoublets is not that smart, so it's not going to deal well with barcode collisions. I believe what I did was pre-select barcodes from a few BAM files that didn't have collisions (for example, the top 5000 largest cells from 5 different BAMS), filter each BAM to the subset of cells via FilterBamByTag, merge them into a single BAM to test, and use GenerateSyntheticDoublets to generate an additional doublets BAM file that could then be merged into the singlets BAM.

On the speed front: because we treat each cell as an independent observation, you can run the donor assignment programs on subsets of the data. SplitBamByCell will take a BAM file and split it into a number of chunks, each of which contains the complete set of reads for the cell barcodes in the BAM. By doing that early in the process, you can run many chunks of data in parallel, which really speeds things up - we frequently use 2Gb chunks of data in our workflows.

That program's documentation is a little hard to parse the first time, so here's an example:

SplitBamByCell I=input.bam TARGET_BAM_SIZE=2G OUTPUT_LIST=output.bam_list SPLIT_TAG=CB OUTPUT=output.__SPLITNUM__.bam OVERWRITE_EXISTING=true

The other thing that can help is to use a BCF instead of a VCF file. Because I'm using the HTSJDK implementation of a BCF reader, you need to use Picard's VcfFormatConverter. That can net a significant speedup of the genotype data, which winds up being the bottleneck once you have smaller BAM files.

jamesnemesh commented 2 years ago

@drneavin If you don't mind, I had a few more thoughts on additional bake off scenarios you might consider.

1) Related individuals: How well does donor assignment work when you have related individuals or full trios in your data set: That is, mother/father/one or more children.

Our programs work just as well on related individuals and unrelated individuals, including full trios.

2) Sample swap issues - If your lab is generating relatively small pools (~20 donors) but you're working with hundreds or even thousands of samples, what happens when someone grabs the wrong tube and an unexpected donor is introduced into the pool? Then what is the behavior if you do have genotype information, or you don't have information for the sample (IE: are they in the VCF?)

We run AssignCellsToSamples without a SAMPLE_FILE. Because our program runs fairly quickly, we can use a VCF with 1000+ individuals (and pool sizes in excess of 100 individuals) easily. When a donor is introduced that is in the VCF, we'll identify that donor properly for the set of all possible donors. When the donor introduced is not present in the VCF (a cryptic donor), the single donor assignment qualities will be poor, and the cells will all be flagged as mixtures of two known donors. This makes sense, as a combination of two random donors would better explain a random set of alleles than a single individual.

It's possible Vireo would do really well with cryptic donors as it has fewer priors - perhaps you want to keep the cells from that donor and know that they all come from one true (but unknown) source instead of being thrown away.

We've encountered plenty of plate swap and cryptic donor issues, especially when working with 3rd party sources of samples, so this sort of utility is incredibly important for a production environment.

drneavin commented 2 years ago

Thanks @jamesnemesh! I think that the first scenario will be hard for us to test with our current dataset but I think there may be some publicly available that we might try.

The second question is a good one and not one I've done a comparison with but I think some of the methods will clearly just call the missing individuals as doublets but depends on how the parameters are set. I'll have to think about the best way to compare this across methods

Also, I was going to test out the GenerateSyntheticDoublets script but can't seem to find the executable available in Drop-seq_tools-2.5.1. Is the executable something I should be able to generate or somewhere else I could find it? If I can generate it, could you give me the general commands to do it (sorry, it's not something I've done before so I'm not as familiar with it)

jamesnemesh commented 2 years ago

For the first scenario, if you have a VCF with more samples in it than the set in your pool, you can run on the entire VCF. I imagine it's hard to get the wrong donor assignment and see where your algorithm is making mistakes if you only allow the algorithm to pick valid choices.

As for GenerateSyntheticDoublets - looks like we’re not exposing a command line stand-alone for it, but it can be accessed from the jar:

java -jar Drop-seq_tools-2.5.1/jar/dropseq.jar GenerateSyntheticDoublets
USAGE: GenerateSyntheticDoublets [arguments]

Takes a fraction of the cell barcodes and merges the barcode labels and reads of those cells into synthetic doublets. 
For example if cell A and B wereselected to be a doublet, then a new cell called A_B would be generated that contained
all the reads from the A and B cell barcodes.
Version:2.5.1(680c2ea_1642084299)