YingZhou001 / Immuannot

Immuological gene typing and annotation for genome assembly
MIT License
30 stars 5 forks source link

CDS measures #8

Open ymokrab opened 1 month ago

ymokrab commented 1 month ago

Hi Can you explain how you calculate CDS distance, CDS consensus and CDS mutations? Thanks

YingZhou001 commented 1 month ago

Hi,

CDS distance is calculated based on the output by minimap2, which is the NM:i:X. CDS mutations are extracted based on the CS string, which is also the output of minimap2. CDS consensus is summarized from the consensus part of the alleles with exact the shortest CDS distance and the longest matching length.

You may check the section of "Typing genes in the IPD-IMGT/HLA and IPD-KIR databases" in our manuscript for related description. https://genome.cshlp.org/content/early/2024/06/05/gr.278985.124.abstract

Best, Ying

ymokrab commented 1 month ago

Thank you @YingZhou001 . I am still unclear how some of the output results are structured. Let us consider the following example output whereby I seem to have found a novel allele of TAP2 gene for an assembly am working on. What is in the entries of CDS_mutation field and? What does it mean to have cds distance value of 2? Do CDS mutations include also indels and frameshift or just SNPs? example

YingZhou001 commented 1 month ago

Sorry for the confusion here.

In your example, there are four alleles from IMGT that all have only two mutations difference (cds distance=2) from your target TAP2 CDS sequence. CDS distance is the 'edit distance' between CDSs. Usually CDS mutations donot include indels but when indels happen, the record will be showing the changes, for example:

cds_mut "HLA-W05:02|:67tc:41ga:7ac:149-a:365-g:185ct:56tc:41*ct:19|Tre(ACG)<Met(ATG):Gln(CAG)<Rrg(CGG):Rrg(CGC)<Ser(AGC):ProLys(CCCAAA)<CCCAAAA(CCCAAAA):Val(GTG)<GTGG(GTGG):Val(GTG)<Ala(GCG):Pro(CCT)<Leu(CTT):Leu(CTC)<Pro(CCC);";

you can read deletion here from 'cs' string ":67tc:41ga:7ac:149-a:365-g:185ct:56tc:41ct:19" or animo acid translation : ProLys(CCCAAA)<CCCAAAA(CCCAAAA),