drneavin / Demultiplexing_Doublet_Detecting_Docs

MIT License
13 stars 1 forks source link

Combined upset plot #14

Closed changostraw closed 1 year ago

changostraw commented 1 year ago

Hi again, I was wondering how the upset plot data is laid out for individual assignment. It looks like you converted singlets to "1" and anything labelled as a doublet to "0" for the upset. However, I am confused as to how you decided the "Final Individual Assignment" label. What did you do in the case of one tool assigning a different label than another tool? For example if demuxlet assigns a nuclei to donor 0 for example, souporcell assigned it to unassigned, vireo assigned it to donor 1, and freemuxlet to doublet. How would your upset plot show that? thanks!

drneavin commented 1 year ago

Hi @changostraw,

This will depend on how you have run the Combine_Results.R script. You can use multiple different methods for combining the results with --method so the way it calls the combined annotation for each cell will be dependent on what you selected for that. In addition, the way that it combines the different methods results depends on the inputs for each as well:

Let me know if that helps clarify or if you need some additional information.

Cheers, Drew

changostraw commented 1 year ago

Okay thanks that makes sense. I used majority singlet as the method and was looking at the wrong file for the consensus Majority Singlet Droplet Type call. I am still getting donor 1, donor0 etc despite having done the genotype step for Freemuxlet and Souporcell (I didn't add genotype to the vireo with 1000G results). The individual assignments for Freemuxlet and Souporcell with 1000G didn't get picked up it seems. I wanted to combine results from Demuxlet, Freemuxlet, Souporcell using raw donor GT, imputed donor GT and common variants from 1000G, Vireo using all three vcf files as well. However, the combine function doesn't seem to be able to call from 3 different souporcell and vireo results in the same file. Is there a way to do this? If not I can do them separately and combine them in R after. However, then I would need to figure out how to remake the upset_df dataframe with my combined 'combined_results_w_combined_assignments.tsv' if that makes sense. I have tried to do that and have had no success.

I also have a list of singlet donor assigned barcodes from physical cholesterol modified oligo tagging (hastag oligos) that I would like to add to the upset plot. So I added this information to the combined_results_w_combined_assignments.tsv. and went through the Combine_Results.R code and tried running your code on my combined results but I am new to bioinformatics and R and wasn't able to figure out how you converted the combined_results_w_combined_assignments.tsv file into a structure suitable for the upset plot with 1's and 0's and the consensus assignment. If I could recreate that dataframe within R, I could add the extra souporcell and vireo columns and the hastag oligo information and make an integrated upset plot.

Sorry for all the questions! and in one post. If you prefer I can post each question separately. Thanks so much for this pipeline and all the documentation. It makes it so much easier for someone new to bioinformatics!

drneavin commented 1 year ago

Hi @changostraw,

Really great points. First, I would say that you should indicate the -ref as either freemuxlet or souporcell so that you get the donor IDs you are looking for distributed across all the methods.

Regarding your question about multiple different results from souporcell, vireo or any other method, this is not something I've implemented but I can definitely see how it would be very useful. I'll have to think a bit about how to do this. At first thought, a user could provide a list of results for a given parameter (ie a list of souporcell results) but if there are only genotype matching files for some, I will have to consider how that will be entered and how I would need to code it to allow that.

For now, I can help you sort out some code to make an upset plot with all your data. Could you send me either the file (you can email it to me if you prefer at d.neavin@garvan.org.au) or just send through a head of the file so I can see the structure and hopefully put together some code for the upset plot? I agree that working out code for upset plots can be rather mind boggling - especially for complicated cases like this.

I'm glad you're asking questions because it will help me optimize the tool and documentation and will help other users who may have the same questions and can then find the solutions here.

changostraw commented 1 year ago

Thanks for the offer @drneavin ! I actually think I figured out most of it. But I have a few more questions.

So what I did was create two combined reports with Demuxafy so that I could have two sets of souporcell and vireo results and then combined them in R. I then added the assignments from the hastag oligos to the table. I also calculated both "MajoritySinglet" and "AnyDoublet" using the most tools as possible and added that information to the table. So my final table so far is: CMOs (hastag oligos) Freemuxlet 1000G Souporcell 1000G Vireo 1000G Demuxlet rawGT Souporcell rawGT Vireo rawGT Scds (I will eventually add Demuxlet, Souporcell, Vireo with imputed donorGT as well)

The MajoritySinglet and the AnyDoublet calculations were made with: Freemuxlet 1000G Souporcell 1000G Vireo 1000G Demuxlet rawGT Scds

I wanted to see if demultiplexing by genotype is as good as demultiplexing based on hashtag oligos, and if so, what method is best. So I calculated 1. CMO assignment vs MajoritySinglet assignment and 2. CMO assignment vs AnyDoublet (here I called it NoDoublet) and made an upset plot for each - attached here. Basically if the CMO and GT assignment matched it returned the agreed assignment (309,318,325,366 or Multiplet), if they didn't match it returned "Disagree", if there was no CMO assignment = "No_CMO" and no GT assignment = "no_GT". I think it worked pretty well and indicates, as you suggest in your paper, that MajoritySinglet seems to be the most accurate (at least vs hashtag oligos). However, I am not sure I am correctly interpreting the MajoritySinglet and AnyDoublet methods.

I assumed MajoritySinglet means that the majority of the tools selected agreed on the assignment (ie >50% of them). I assumed the AnyDoublet would mean that if one tool called a doublet it would be assigned doublet. However, I have noticed that in my AnyDoublet column there are doublet assignments even when none of the tools assigned it a doublet. For example if Demuxlet is "unassigned", Scds is "singlet" and Vireo, Souporcell, Freemuxlet are "singlet" and agree on assignment, AnyDoublet calls this a doublet. Is this correct? If so, is there a way to ignore "unassigned" in the calculations? (Unfortunately, right now I have an issue with the rawGT of one of the samples in my pool, so that is why Demuxlet rawGT has so many unassigned.) However, in general I am wondering if it wouldn't be better to treat unassigned as NA in the combining methods. Does that make sense?

Thanks!

drneavin commented 1 year ago

Hi @changostraw ,

Very glad you were able to work this out and also very glad to hear that the oligos and the Demuxafy results seem to have pretty good agreement. We have compared antibody hashtag to genetic-based demultiplexing and find that the genetic data is slightly better because there isn't oligo/antibody jumping for genetic information and because there are a good number of cells that don't have enough antibody reads to be called for a given hashtag. It looks like that is probably what has happened for a small subset of your data as well since the demultiplexing and doublet assignments seem to be relatively consistent for those that either disagree or are missing CMO. It's great to hear that the oligos that you're working with have similar profiles. It seems like you have very clean data from what you've shared.

Regarding the MajoritySinglet and AnyDoublet Demuxafy calls, your assumptions on how the calls should be made are accurate. Can you confirm a few things?

  1. Does this issue with miscalling doublets occur just with the AnyDoublet calls (ie not with the MarjoritySinglet)?
  2. For these droplets that are being called as 'doublets' but have an 'unassigned' for demuxlet, can you tell me what the Demuxlet_DropletType and Demuxlet_Individual_Assignment are?
  3. Can you confirm what singularity image version you're using? I think it's >= 1.0.1 based on the images you sent through but just want to double check.

I previously decided to call 'unassigned' as 'doublets' but more recently have changed the coding so that (at least with the MajoritySinglet method) droplets that are called as singlets but have a combination of individual assignments (or unassigned) from different demultiplexing methods have been called as singlets + unassigned. I'm wondering if I didn't quite carry this coding through for the AnyDoublet classifications.

changostraw commented 1 year ago

Hi Drew, sorry for the delay. answers to your questions:

  1. The issue with the combined outputs calling "unassigned" droplets "doublets" is in both MajoritySinglet and AnyDoublet outputs. In MajoritySinglet if most of the tools call a droplet "unassigned" the MajoritySinglet will assign "doublet". There aren't any "unassigned" labelling in the MajoritySinglet or AnyDoublet columns at all. It is always either "doublet" or "singlet".
  2. For Demuxlet unassigned droplets in the MajoritySinglet output the Demuxlet_DropletType and Demuxlet_Individual_Assignment are both "unassigned" and the MajoritySinglet is always "singlet" if the other tools say singlet. For Demuxlet unassigned droplets in the AnyDoublet output the Demuxlet_DropletType and Demuxlet_Individual_Assignment are both "unassigned" and the AnyDoublet columns are both "doublet".
  3. Here is the information on the image version: Author: = Drew Neavin Image_version: = 2.0.1 maintainer: Anaconda, Inc org.label-schema.build-arch: amd64 org.label-schema.build-date: Friday_20_January_2023_0:58:22_UTC org.label-schema.schema-version: 1.0 org.label-schema.usage.singularity.deffile.bootstrap: docker org.label-schema.usage.singularity.deffile.from: continuumio/miniconda3 org.label-schema.usage.singularity.version: 3.10.0-dirty org.opencontainers.image.created: 2022-12-23T15:47:51.857Z org.opencontainers.image.description: Repository of Docker images created by Continuum Analytics org.opencontainers.image.licenses: org.opencontainers.image.revision: e50c6323cd703e0c2df7aa56eb18e05757df2402 org.opencontainers.image.source: https://github.com/ContinuumIO/docker-images org.opencontainers.image.title: docker-images org.opencontainers.image.url: https://github.com/ContinuumIO/docker-images org.opencontainers.image.version: 22.11.1 Thanks!
drneavin commented 1 year ago

Hi @changostraw,

I think I had opted for 'doublet' in the droplet type columns because unassigned annotations couldn't be effectively used anyway and so would be remove. However, this is something I can change in future versions since you've pointed out that it would be beneficial. However, I don't think I'll be able to make these changes in the short term (ie next month) as my plate is full with some other projects.

In the meantime, let me know if you'd like some help coming up with some code to update those rows to 'unassigned' in the droplet type column to help with your assessments.

Cheers, Drew

changostraw commented 1 year ago

Thanks @drneavin! I think I have a workaround. There are some issues with some of my pools for freemuxlet where they are all unassigned. However, the genotype correlation was pretty obvious. Am I correct in the assumption that there is a cutoff where the genotype correlations have to be => ~ 0.75? If so, is there anyway to lower it? I have some lower value correlations that are still obvious (0.71 for one cluster vs 0.4 for every other cluster). Thanks!

drneavin commented 1 year ago

Hi @changostraw,

There is a flag to indicate what correlation is considered high enough for inclusion as a called cluster of cells for a given donor. You can specify it with. There is a different flag for each software so for souporcell, it’s --souporcell_correlation_limit. By default it’s set to 0.7 but can be set to whatever threshold you want.

I would personally be hesitant about the annotating the cluster with the 0.4 correlation unless you think there is a reason for such low correlation (small number of cells, poor genotype data). If there isn’t a clear reason for that low correlation, it suggests to me that the identified cluster likely contains cells from another donor or doublets or that there is high ambient rna in the data (which you can check with souporcell and vireo as well as other methods like DecontX and CellBender). if you have the genotype, it might be best to use genotype informed methods like vireo or demuxlet depending on your experimental design and the expected reason for the poor correlation between the snp genotypes and inferred snp genotypes from the cells.

changostraw commented 1 year ago

thanks @drneavin, sorry I wasn't clear. I have 21 pools of 4 samples each. After running the genotype correlation for freemuxlet all of the freemuxlet pools have one cluster that is clearly correlated with one genotype, however, some of the correlation values are in the low 0.7's (but these are still clear as the other freemuxlet clusters are only reaching 0.4 - 0.5 for that genotype - there are no correlations values lower than 0.7 that I am picking.). However, even though these are clear correlations, when I do the combined output the freemuxlet clusters with <.75 are not assigned to the genotype, they are instead called a "doublet". It seems the threshold for this is ~0.75 based at looking at all the pools. It seems that freemuxlet clusters that have a correlation value of >0.75 get assigned to the genotype in the combined output, but freemuxlet clusters that have a correlation value of < 0.75 are assigned as "doublet".

For example for one pool freemuxlet results genotype correlations are: Cluster 1_GT13359 2_GT13375 3_GT13377 4_GT14352 CLUST0 0.5037766821022964 0.5083906675311902 0.4979855623303216 0.7286447697287014 CLUST1 0.5004675995379368 0.7669771215639491 0.4997817624567396 0.4926245174611682 CLUST2 0.7652384719802461 0.5048069449959758 0.49648566942231986 0.4919653643516229 CLUST3 0.5008796427485248 0.5098617043175034 0.7496989304110875 0.4934534238166346

In the output files freemuxlet CLUST2 and CLUST1 are assigned to genotypes 1_GT13359 and 2_GT13375, but freemuxlet CLUST3 and CLUST0 are assigned as "doublets", not genotypes 3_GT13377 and 4_GT14352.

So I was wondering if there was an option in the Combine_Results.R that was setting this threshold? But I guess you are saying this would be something that would have to be done during the Assign_Indiv_by_Geno.R step?

Hopefully that is more clear! thanks!

drneavin commented 1 year ago

Hi @changostraw, you can set this in the Assign_Indiv_by_Geno.R but you can also select the threshold in the Combine_Results.R script. For example, the parameter for souporcell would be -k SOUPORCELL_CORRELATION_LIMIT. The default is 0.7 so you could increase it to 0.75.

Cheers!

-Drew