Discrepancy between the DPA1_prot.txt and DPA1_nuc.txt files for the DPA1*03:05:01Q alleles

RomuloVianna commented 8 months ago

There is a discrepancy between the DPA1_prot.txt and DPA1_nuc.txt files for the -DPA1"03:05:01Q alleles on v3.55. According to the alignment information for the protein obtained from the database (DPA1_prot.txt file), the first 6 amino acids present on the -DPA1"01:03:01 are not expressed in the -DPA1"03:05:01Q allele (deletions indicated as 6 dots) as displayed below:

unnamed

However, the information for the coding nucleotides alignment of -DPA1"03:05:01Q on the DPA1_nuc.txt file shows the first 6 codons as part of the exon 1 and not as deletions (dots) as shown for the protein alignment.

unnamed

I assume that either DPA1_prot.txt or the DPA1_nuc.txt file should be adjusted so both agree on the same information of whether the first coding codon from the -DPA1"03:05:01Q alleles is ATG (on the seventh codon aligned to -DPA1"01:03:01) or ACG (as the first aligned to -DPA1"01:03:01).

dominicbarkerAN commented 8 months ago

Hello, the alleles you are describing, DPA103:05:01:01Q, DPA103:05:01:02Q and DPA103:05:02Q all contain mutations in the start codon, the ATG starting at position 1 of the cDNA is ACG in these alleles. These nucleotides are present in the sequence and included in their homologous positions in the CDS alignment. However the protein translation cannot start here, because the start codon is disrupted. The translation therefore starts at the next ATG beginning at position 19 in the cDNA. The 6 codons that are not translated are correctly marked up as deletions in the protein alignment. This happens in many other alleles in other genes, as well as DPA102:64Q in DPA1.

I hope this explanation answers your query.

RomuloVianna commented 8 months ago

Hi @dominicbarkerAN, thanks for your detailed reply! Assuming what you said that the first 6 codons (ACG CGC CCT GAA GAC AGA) of -DPA1"03:05:01Q alleles are not translated, so we can say that this is not a coding sequence. Also, considering that the files designated “X_nuc.txt”, where X is a locus or gene, contain the nucleotide coding sequences (CDS), shouldn't the first 6 codons be considered as a part of the 5'UTR region rather than a part of a coding sequence such as the exon 1?

dominicbarkerAN commented 8 months ago

Hi @RomuloVianna

You are correct to say that these nucleotides are not translated and therefore not part of the coding sequence. However we include the homologous CDS regions of alleles with respect to the reference for that locus. For this reason we include nucleotides up to the canonical start codon even in cases where a mutation causes is to be disrupted. Similarly we include nucleotide sequence up to the canonical stop codon for alleles which have a premature stop. Polymorphisms can occur in these regions and need to be represented to prevent ambiguous typing results.

RomuloVianna commented 8 months ago

Hi again @dominicbarkerAN

I am sorry if I am missing something here but I really don't get why aren't the first 6 codons (ACG CGC CCT GAA GAC AGA) of -DPA1"03:05:01:01Q represented as dots on the alignment for the DPA1_nuc.txt file as well as for example the -DRB4"01:03:01:02N is represented in the DRB4_nuc.txt file on the beggining of its exon 2 which starts a few nucleotides after the canonical start as discussed on https://github.com/ANHIG/IMGTHLA/issues/298

Speaking of the -DRB4"01:03:01:02N allele, the genomic sequence contains nucleotides that match with the beginning of exon 2 in a homologous way but since they do not belong to the exon itself, I see that this region is not represented with the nucleotide letters but with dots instead (image above).

Going back to the -DPA1"03:05:01:01Q allele, I double-checked and the hla.dat file also confirms that the exon 1 start for the -DPA1"03:05:01:01Q is after the first 6 codons.

FT CDS join(519..618,4203..4448,4789..5070,5285..5439)

ANHIG / IMGTHLA

Discrepancy between the DPA1_prot.txt and DPA1_nuc.txt files for the DPA1*03:05:01Q alleles #361