algo-cancer / ImmunoTyper-SR

Genotyping Immunoglobulin Heavy Chain Variable Genes using Short Read Data
Other
8 stars 1 forks source link

output question #12

Open wbvguo opened 1 year ago

wbvguo commented 1 year ago

Dear Mike,

Thanks for your previous help! I successfully run the tool on a human sample and get the output IGHV_functional_allele_calls.txt, but I am curious what does the output mean. Specifically, I got some lines like

IGHV6-1*01
IGHV6-1*01
IGHV6-1*01
IGHV1-2*04
IGHV1-2*04
IGHV1-18*01
IGHV1-18*01
IGHV1-8*01
IGHV1-69*06
IGHV1-69*01
IGHV1-69*01

Considering human is diploid organism, I would expect for each gene, there should be 2 records (either the same allele appearing twice for homozygotes or different alleles for heterozygotes). But in the output some alleles appears more than 2 times (e.g. IGHV6-1*01), and some gene only have one allele count (e.g. IGHV1-8). May I ask why we could see such phenomenon?

On the other hand, may I confirm that IGHV_functional_allele_calls.txt is always a subset of IGHV_allele_calls.txt?

Thanks,

michael-ford commented 1 year ago

Hi Wenbin,

First off, yes IGHV_functional_allele_calls.txt should always be a subset of IGHV_allele_calls.txt.

IGHV has many known structural variants which can result in copy number variants of the V genes, such as more or less than 2 copies. Each line of the output represents on copy corresponding to the listed allele.

For more information on the structural variants in IGH I recommend the following papers:

Watson, Corey T., et al. "Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation." The American Journal of Human Genetics 92.4 (2013): 530-546. Rodriguez, Oscar

L., et al. "A novel framework for characterizing genomic haplotype diversity in the human immunoglobulin heavy chain locus." Frontiers in immunology 11 (2020): 2136. 

There is one important thing to note: IMGT annotations include duplicate genes (identified with the D notation) for some, but not all of these potential CNV genes. However, since ImmunoTyper is reference-free and does not do whole-locus assembly, it is unable to distinguish between say 1-69*01 and 1-69D*01, which have identical coding sequences. As a result, it will simply output IGHV1-69*01 for every copy corresponding to that coding sequence.

I appreciate you bringing this up, I think your questions suggest a couple desired improvements for clarity:

wbvguo commented 11 months ago

Dear Mike,

Thank you very much for your previous help! When I ran this tool for a specific gene_type, I noticed it would first extract the reads from genomic regions of interest of that gene_type, then extract the unmapped reads, redirect them into a fasta file called <prefix>-extracted.fa. If this file already exists, the tool will skip the extraction and directly use the existing fasta file for allele calling. So if I analyze the IGHV, IGKV, and IGLV sequentially, the IGHV's result will look fine, but IGKV and IGLV will usually have empty output (because fasta file generated in IGHV analysis is not overwritten, but being used to call IGKV or IGLV alleles. In such scenarios, the empty output is expected)

It took me some time to recognize this issue, as I previously thought the empty output might be due to IGKV and IGLV regions being too complicated or with too shallow depth. While this issue can be easily fixed on the user's side: ensure there is no xxxx-extracted.fa file before running any gene_type analysis. I guess a more convenient fix is to update the code to have a more specific fasta file name (e.g., instead of xxxx-extracted.fa, we can use xxxx-${gene_type}_extracted.fa) or provide an option to let users decide the name of fasta file.

On the other hand, I also have another suggestion regarding the unmapped reads. Currently, the tool will perform the same unmapped reads' extraction process for each gene_type analysis. Given that extracting the unmapped reads is more time-consuming than extracting from regions of interest. If a user wants to run different gene_type analyses on the same sample, this unmapped reads extraction step can be restricted to only executed once for better computational efficiency.

Moreover, the current procedure will align the unmapped reads to the allele reference to rescue some useful reads. This is a great idea, but it looks to me that the tool tries to align unmapped reads to just one particular gene_type (the one being investigated) instead of aligning them to the full spectrum of alleles from all gene_type. As unmapped reads can come from anywhere on (or outside) the genome, aligning to one gene_type alleles' reference might somewhat bias the results. For example, say we analyze IGHV and align the unmapped reads to the IGHV alleles reference. If some of the unmapped reads are generated by alleles in the IGKV region (which might be similar to the IGHV alleles). Aligning these reads to the full spectrum of allele reference will mark many of them as multi-mapping, but aligning these reads to IGHV allele reference will mark many of them as unique mapping. I am unsure how this multi-mapping will interfere with the final results, but it seems more reasonable to align the unmapped reads to all possible allele references? (apologize in advance if I misunderstand the tool)

In short, I guess my suggestions can be summarized below:

perhaps we can optimize the current workflow from

# for a particular gene type analysis
if xxxx-extracted.fa does not exist: 
    extract reads the regions of interest > xxxx-extracted.fa
    extract unmapped reads >> xxxx-extracted.fa

use xxxx-extracted.fa to call alleles

to sth like

# for a particular gene type analysis
if xxxx-unmapped_summary.txt does not exist:
    extract unmapped reads > xxxx-unmapped_extracted.fa
    align the xxxx-unmapped_extracted.fa to full spectrum alleles
    generate xxxx-unmapped_summary.txt

extract reads from regions of interest > xxxx-${gene_type}_extracted.fa
align the xxxx-${gene_type}_extracted.fa to ${gene_type} allele's reference 
extract the xxxx-${gene_type}_summary.txt 

extract information from xxxx-unmapped_summary.txt for the specific `gene_type` and merge with xxxx-${gene_type}_summary.txt , then identify alleles

I hope these suggestions can help in some way.

Best,