10XGenomics / scHLAcount

Count HLA alleles in single-cell RNA-seq data
MIT License
58 stars 12 forks source link

Emtpy .mtx #10

Open chris-mcginnis-ucsf opened 4 years ago

chris-mcginnis-ucsf commented 4 years ago

Hi @cdarby

Thanks for making this tool! I'm having some trouble interpreting the outputs, and I'm hoping you can help.

So I've run the pipeline twice now on two different 3' scRNA-seq PBMC datasets -- one is the Frozen Donor B data from Zheng et al (2017, Nat Comm - V1 chemistry), and another is a dataset of 4 unrelated PBMC donors that I generated (V2 chemistry). It may be worth noting that I can split the cellID list by donor if necessary, but I ran it as a batch here.

The pipeline completes successfully, and I see distinct allele labels in the labels.tsv file for each output. However, I have no counts in the .mtx file for either dataset, and I don't really know what to make of the weights.tsv output.

If the .mtx matrix is 'empty', does this mean that scHLAcount is not accurately characterizing the data? My end goal for this analysis is to get an HLA-type for each donor (as you would normally do with standard HLA typing assays).

Chris McGinnis

Here's my script, for reference:

./sc_hla_count --bam /Users/gartnerlab/Desktop/scHLAcount/possorted_genome_bam.bam --cell-barcodes /Users/gartnerlab/Desktop/SLNR_PBMC/REVIEW/cellIDs_B.txt -d /Users/gartnerlab/Desktop/scHLAcount/IMGT/ ./sc_hla_count --bam /Volumes/MULTIseq_1/SLNR_PBMC/crRNA/SLNR_L2/outs/possorted_genome_bam.bam --cell-barcodes /Users/gartnerlab/Desktop/SLNR_PBMC/REVIEW/cellIDs_ficoll.txt -d /Users/gartnerlab/Desktop/scHLAcount/IMGT/

cdarby commented 4 years ago

Hi Chris I'm not sure why the matrix file is empty. But if you have weights.tsv, then the genotyping stage has been completed, which seems to be the goal of your analysis. After genotyping, the top genotypes should have been selected (and will appear in the labels.tsv file) and the molecule counting stage should have run. What was the output log to the command line for the molecule counting stage?

The weights.tsv file columns are: the allele name, weight of the allele in the expectation maximization (EM) algorithm, and the number of reads explained by that allele. By default it reports the top 10 alleles for each gene.

You should also see a pairs file, where the first two columns are the allele name and the third column is the number of reads explained by this pair of alleles - this is relevant when the individual is heterozygous. By default it reports the top 5 pairs for each gene.

The genotyping algorithm assumes the individual is diploid and will not work properly if there are cells from more than one individual in the dataset.

Please let me know if you have further questions Charlotte

chris-mcginnis-ucsf commented 4 years ago

Hi Charlotte (@cdarby),

Thanks for the quick response. I agree that I don't necessarily need the .mtx file for my purposes, I was just curious if this could suggest that the pipeline is not working properly.

For the record, here's the output log: Sharding sequences... Got 364365 sequence chunks Assembling 111 shards...

Done separate de Bruijn graph construction Starting merging disjoint graphs Merger of graphs complete Indexing de Bruijn graph Total 56910 kmers to process in dbg Making mphf of kmers Assigning offsets to kmers Sharding sequences... Got 10720 sequence chunks Assembling 6 shards...

Done separate de Bruijn graph construction Starting merging disjoint graphs Merger of graphs complete Indexing de Bruijn graph Total 71838 kmers to process in dbg Making mphf of kmers Assigning offsets to kmers Sharding sequences... Got 1203 sequence chunks Assembling 1 shards...

Done separate de Bruijn graph construction Starting merging disjoint graphs Merger of graphs complete Indexing de Bruijn graph Total 8078 kmers to process in dbg Making mphf of kmers Assigning offsets to kmers

On a semi-related note, are there any 'red-flags' that you could look for in the outputs that would suggest the HLA typing was inaccurate? I do not have HLA-type ground-truth for these samples and I am basically trying to determine whether cells with the predicted the HLA-type would engage in an allogeneic response.

For example, when I look at the 'pairs' file from one run, I see that the top 5 pairs explain similar proportions of the total reads (e.g., +/- 0.2%): image

Given this result, how confident would you be in claiming that these cells are heterozygous at HLA-A with the 02:52 and 24:234 alleles?

Chris

chris-mcginnis-ucsf commented 4 years ago

Follow-up question:

I understand that the algorithm won't work properly if more than one donor is present in the dataset. However, is this also true if you use donor-specific cell ID lists for the input? For example, I assigned cells into donor groups using an in silico genotyping approach and then processed the same bam file using these different cell ID sets -- this resulted in the same set of HLA alleles being identified for the two unrelated donors.

If there's no way to handle this within the algorithm itself, what are your thoughts on parsing the FASTQs according to the cell ID lists before generating BAM files which can then be used for scHLAcount?

cdarby commented 4 years ago

Hi Chris

Quick note - the genotyping algorithm is still a work-in-progress, although none of the team are actively developing it right now. Limited testing was performed on samples with known HLA type. You may consider running other tools designed for genotyping from other sequencing types (e.g. bulk RNA-seq) on your reads to see if they agree with the scHLAcount results. For the HLA-A results you show, I would say there is not enough evidence to prefer one of the two-field resolution types over the other because the top pairs are nearly exactly tied. It does seem to give evidence for the genotype at one-field resolution of A02/A24 though.

In terms of giving you confidence that the scHLAcount genotype agrees with the genotype that would be reported by HLA typing assays, we just didn't get to that point with validating the performance of the genotyping algorithm. If I were to brainstorm how you might validate the genotypes from scHLAcount or another software, perhaps you could get the coding sequences of the top alleles from the IMGT/HLA database, remap the HLA-A reads to the allele sequences using Bowtie2/BWA-MEM for example, and explore whether there are fewer mismatches in the alignments to the personalized alleles compared to the STAR alignments (from Cell Ranger) to the reference genome.

I see you're using 3'GEX data. Anecdotally, we noticed that genotyping from 3'GEX was less reliable than from 5'GEX, possibly due to lower coverage in the typing exons (of HLA class I). We did not test any datasets using V1 chemistry, so I can't speak to any considerations for that dataset.

It looks like the command line output you posted only completed the database graph construction and did not report analysis of any reads, so I am surprised that genotypes were reported. (e.g. there should be a log entry that says "Processed X reads, Y pseudoaligned to HLA reference".) Did it crash?

I checked the code and it appears that scHLAcount only ignores reads with barcodes that are not in the list you provide for the molecule counting step and uses all reads for the genotyping step. This would explain why you see the same results when providing different barcode files. To manually separate the files, given your lists of cell barcodes, I suppose you could write a script to scan the "CB" tags in the BAM files and match them with the ones on the lists.

Charlotte