Closed changostraw closed 1 year ago
Hi @changostraw,
This will depend on how you have run the Combine_Results.R
script. You can use multiple different methods for combining the results with --method
so the way it calls the combined annotation for each cell will be dependent on what you selected for that. In addition, the way that it combines the different methods results depends on the inputs for each as well:
--method
you used--ref
. This defaults to vireo (if provided), then demuxlet (if provided), then freemuxlet (if provided), then souporcell (if provided). The required proportion overlapping between the groups can be updated with --pct_agreement
but defaults to 0.7. Therefore, any cluster that does not overlap > 0.7 with a cluster of the selected reference, will be annotated as unassigned for that software.Let me know if that helps clarify or if you need some additional information.
Cheers, Drew
Okay thanks that makes sense. I used majority singlet as the method and was looking at the wrong file for the consensus Majority Singlet Droplet Type call. I am still getting donor 1, donor0 etc despite having done the genotype step for Freemuxlet and Souporcell (I didn't add genotype to the vireo with 1000G results). The individual assignments for Freemuxlet and Souporcell with 1000G didn't get picked up it seems. I wanted to combine results from Demuxlet, Freemuxlet, Souporcell using raw donor GT, imputed donor GT and common variants from 1000G, Vireo using all three vcf files as well. However, the combine function doesn't seem to be able to call from 3 different souporcell and vireo results in the same file. Is there a way to do this? If not I can do them separately and combine them in R after. However, then I would need to figure out how to remake the upset_df dataframe with my combined 'combined_results_w_combined_assignments.tsv' if that makes sense. I have tried to do that and have had no success.
I also have a list of singlet donor assigned barcodes from physical cholesterol modified oligo tagging (hastag oligos) that I would like to add to the upset plot. So I added this information to the combined_results_w_combined_assignments.tsv. and went through the Combine_Results.R code and tried running your code on my combined results but I am new to bioinformatics and R and wasn't able to figure out how you converted the combined_results_w_combined_assignments.tsv file into a structure suitable for the upset plot with 1's and 0's and the consensus assignment. If I could recreate that dataframe within R, I could add the extra souporcell and vireo columns and the hastag oligo information and make an integrated upset plot.
Sorry for all the questions! and in one post. If you prefer I can post each question separately. Thanks so much for this pipeline and all the documentation. It makes it so much easier for someone new to bioinformatics!
Hi @changostraw,
Really great points. First, I would say that you should indicate the -ref
as either freemuxlet
or souporcell
so that you get the donor IDs you are looking for distributed across all the methods.
Regarding your question about multiple different results from souporcell, vireo or any other method, this is not something I've implemented but I can definitely see how it would be very useful. I'll have to think a bit about how to do this. At first thought, a user could provide a list of results for a given parameter (ie a list of souporcell results) but if there are only genotype matching files for some, I will have to consider how that will be entered and how I would need to code it to allow that.
For now, I can help you sort out some code to make an upset plot with all your data. Could you send me either the file (you can email it to me if you prefer at d.neavin@garvan.org.au) or just send through a head of the file so I can see the structure and hopefully put together some code for the upset plot? I agree that working out code for upset plots can be rather mind boggling - especially for complicated cases like this.
I'm glad you're asking questions because it will help me optimize the tool and documentation and will help other users who may have the same questions and can then find the solutions here.
Thanks for the offer @drneavin ! I actually think I figured out most of it. But I have a few more questions.
So what I did was create two combined reports with Demuxafy so that I could have two sets of souporcell and vireo results and then combined them in R. I then added the assignments from the hastag oligos to the table. I also calculated both "MajoritySinglet" and "AnyDoublet" using the most tools as possible and added that information to the table. So my final table so far is: CMOs (hastag oligos) Freemuxlet 1000G Souporcell 1000G Vireo 1000G Demuxlet rawGT Souporcell rawGT Vireo rawGT Scds (I will eventually add Demuxlet, Souporcell, Vireo with imputed donorGT as well)
The MajoritySinglet and the AnyDoublet calculations were made with: Freemuxlet 1000G Souporcell 1000G Vireo 1000G Demuxlet rawGT Scds
I wanted to see if demultiplexing by genotype is as good as demultiplexing based on hashtag oligos, and if so, what method is best. So I calculated 1. CMO assignment vs MajoritySinglet assignment and 2. CMO assignment vs AnyDoublet (here I called it NoDoublet) and made an upset plot for each - attached here. Basically if the CMO and GT assignment matched it returned the agreed assignment (309,318,325,366 or Multiplet), if they didn't match it returned "Disagree", if there was no CMO assignment = "No_CMO" and no GT assignment = "no_GT". I think it worked pretty well and indicates, as you suggest in your paper, that MajoritySinglet seems to be the most accurate (at least vs hashtag oligos). However, I am not sure I am correctly interpreting the MajoritySinglet and AnyDoublet methods.
I assumed MajoritySinglet means that the majority of the tools selected agreed on the assignment (ie >50% of them). I assumed the AnyDoublet would mean that if one tool called a doublet it would be assigned doublet. However, I have noticed that in my AnyDoublet column there are doublet assignments even when none of the tools assigned it a doublet. For example if Demuxlet is "unassigned", Scds is "singlet" and Vireo, Souporcell, Freemuxlet are "singlet" and agree on assignment, AnyDoublet calls this a doublet. Is this correct? If so, is there a way to ignore "unassigned" in the calculations? (Unfortunately, right now I have an issue with the rawGT of one of the samples in my pool, so that is why Demuxlet rawGT has so many unassigned.) However, in general I am wondering if it wouldn't be better to treat unassigned as NA in the combining methods. Does that make sense?
Thanks!
Hi @changostraw ,
Very glad you were able to work this out and also very glad to hear that the oligos and the Demuxafy results seem to have pretty good agreement. We have compared antibody hashtag to genetic-based demultiplexing and find that the genetic data is slightly better because there isn't oligo/antibody jumping for genetic information and because there are a good number of cells that don't have enough antibody reads to be called for a given hashtag. It looks like that is probably what has happened for a small subset of your data as well since the demultiplexing and doublet assignments seem to be relatively consistent for those that either disagree or are missing CMO. It's great to hear that the oligos that you're working with have similar profiles. It seems like you have very clean data from what you've shared.
Regarding the MajoritySinglet and AnyDoublet Demuxafy calls, your assumptions on how the calls should be made are accurate. Can you confirm a few things?
Demuxlet_DropletType
and Demuxlet_Individual_Assignment
are?I previously decided to call 'unassigned' as 'doublets' but more recently have changed the coding so that (at least with the MajoritySinglet method) droplets that are called as singlets but have a combination of individual assignments (or unassigned) from different demultiplexing methods have been called as singlets + unassigned. I'm wondering if I didn't quite carry this coding through for the AnyDoublet classifications.
Hi Drew, sorry for the delay. answers to your questions:
Hi @changostraw,
I think I had opted for 'doublet' in the droplet type columns because unassigned annotations couldn't be effectively used anyway and so would be remove. However, this is something I can change in future versions since you've pointed out that it would be beneficial. However, I don't think I'll be able to make these changes in the short term (ie next month) as my plate is full with some other projects.
In the meantime, let me know if you'd like some help coming up with some code to update those rows to 'unassigned' in the droplet type column to help with your assessments.
Cheers, Drew
Thanks @drneavin! I think I have a workaround. There are some issues with some of my pools for freemuxlet where they are all unassigned. However, the genotype correlation was pretty obvious. Am I correct in the assumption that there is a cutoff where the genotype correlations have to be => ~ 0.75? If so, is there anyway to lower it? I have some lower value correlations that are still obvious (0.71 for one cluster vs 0.4 for every other cluster). Thanks!
Hi @changostraw,
There is a flag to indicate what correlation is considered high enough for inclusion as a called cluster of cells for a given donor. You can specify it with. There is a different flag for each software so for souporcell, it’s --souporcell_correlation_limit. By default it’s set to 0.7 but can be set to whatever threshold you want.
I would personally be hesitant about the annotating the cluster with the 0.4 correlation unless you think there is a reason for such low correlation (small number of cells, poor genotype data). If there isn’t a clear reason for that low correlation, it suggests to me that the identified cluster likely contains cells from another donor or doublets or that there is high ambient rna in the data (which you can check with souporcell and vireo as well as other methods like DecontX and CellBender). if you have the genotype, it might be best to use genotype informed methods like vireo or demuxlet depending on your experimental design and the expected reason for the poor correlation between the snp genotypes and inferred snp genotypes from the cells.
thanks @drneavin, sorry I wasn't clear. I have 21 pools of 4 samples each. After running the genotype correlation for freemuxlet all of the freemuxlet pools have one cluster that is clearly correlated with one genotype, however, some of the correlation values are in the low 0.7's (but these are still clear as the other freemuxlet clusters are only reaching 0.4 - 0.5 for that genotype - there are no correlations values lower than 0.7 that I am picking.). However, even though these are clear correlations, when I do the combined output the freemuxlet clusters with <.75 are not assigned to the genotype, they are instead called a "doublet". It seems the threshold for this is ~0.75 based at looking at all the pools. It seems that freemuxlet clusters that have a correlation value of >0.75 get assigned to the genotype in the combined output, but freemuxlet clusters that have a correlation value of < 0.75 are assigned as "doublet".
For example for one pool freemuxlet results genotype correlations are: Cluster 1_GT13359 2_GT13375 3_GT13377 4_GT14352 CLUST0 0.5037766821022964 0.5083906675311902 0.4979855623303216 0.7286447697287014 CLUST1 0.5004675995379368 0.7669771215639491 0.4997817624567396 0.4926245174611682 CLUST2 0.7652384719802461 0.5048069449959758 0.49648566942231986 0.4919653643516229 CLUST3 0.5008796427485248 0.5098617043175034 0.7496989304110875 0.4934534238166346
In the output files freemuxlet CLUST2 and CLUST1 are assigned to genotypes 1_GT13359 and 2_GT13375, but freemuxlet CLUST3 and CLUST0 are assigned as "doublets", not genotypes 3_GT13377 and 4_GT14352.
So I was wondering if there was an option in the Combine_Results.R that was setting this threshold? But I guess you are saying this would be something that would have to be done during the Assign_Indiv_by_Geno.R step?
Hopefully that is more clear! thanks!
Hi @changostraw, you can set this in the Assign_Indiv_by_Geno.R
but you can also select the threshold in the Combine_Results.R
script. For example, the parameter for souporcell would be -k SOUPORCELL_CORRELATION_LIMIT
. The default is 0.7 so you could increase it to 0.75.
Cheers!
-Drew
Hi again, I was wondering how the upset plot data is laid out for individual assignment. It looks like you converted singlets to "1" and anything labelled as a doublet to "0" for the upset. However, I am confused as to how you decided the "Final Individual Assignment" label. What did you do in the case of one tool assigning a different label than another tool? For example if demuxlet assigns a nuclei to donor 0 for example, souporcell assigned it to unassigned, vireo assigned it to donor 1, and freemuxlet to doublet. How would your upset plot show that? thanks!