ANHIG / IMGTHLA

Github for files currently published in the IPD-IMGT/HLA FTP Directory hosted at the European Bioinformatics Institute
http://www.ebi.ac.uk/ipd/imgt/hla/
Other
210 stars 61 forks source link

Truncated sequences in MSF files #200

Closed peterjc closed 5 years ago

peterjc commented 5 years ago

e.g. file msf/C_gen.msf on branch 3360 aka v3.36.0 which starts as follows:

$ head C_gen.msf 
!!NA_MULTIPLE_ALIGNMENT

   MSF: 4483  Type: N  Apr 16, 2019  12:20  Check: 0 ..

 Name: C*01:02:01:01    Len:  4483  Check: 9459  Weight:  1.00
 Name: C*01:02:01:02    Len:  4483  Check: 1547  Weight:  1.00
 Name: C*01:02:01:03    Len:  4483  Check:  463  Weight:  1.00
 Name: C*01:02:01:04    Len:  4483  Check: 2234  Weight:  1.00
 Name: C*01:02:01:05    Len:  4483  Check: 5422  Weight:  1.00
 Name: C*01:02:01:06    Len:  4483  Check: 1265  Weight:  1.00

Amongst these, we find two sequences are unusually declared to be shorter at 4482:

 Name: C*05:206         Len:  4482  Check:  871  Weight:  1.00
 Name: C*05:208N        Len:  4482  Check: 1295  Weight:  1.00

Near the end of the file we find:

       C*05:200  .......... .......... .......... ...
       C*05:201  .......... .......... .......... ...
       C*05:203  .......... .......... .......... ...
       C*05:204  .......... .......... .......... ...
       C*05:205  .......... .......... .......... ...
       C*05:206  .......... .......... .......... ..
       C*05:207  .......... .......... .......... ...
      C*05:208N  .......... .......... .......... ..
       C*05:209  .......... .......... .......... ...
       C*05:210  .......... .......... .......... ...
       C*05:211  .......... .......... .......... ...

Sequences C*05:206 and C*05:208N are one letter shorter than the rest, clearly visible here as a missing final dot.

Why are these two not given another trailing dot, giving the same number of letters (4483) as the rest?

Edit: URL https://github.com/ANHIG/IMGTHLA/blob/3360/msf/C_gen.msf

peterjc commented 5 years ago

Going back to branch 3320, aka v3.32.0, another example:

$ head DQB1_prot.msf
!!AA_MULTIPLE_ALIGNMENT

   MSF: 269  Type: P  Apr 17, 2018  15:21  Check: 0 ..

 Name: DQB1*02:01:01    Len:   269  Check: 6644  Weight:  1.00
 Name: DQB1*02:01:02    Len:   269  Check: 5969  Weight:  1.00
 Name: DQB1*02:01:03    Len:   269  Check: 5969  Weight:  1.00
 Name: DQB1*02:01:04    Len:   269  Check: 5261  Weight:  1.00
 Name: DQB1*02:01:05    Len:   269  Check: 5261  Weight:  1.00
 Name: DQB1*02:01:06    Len:   269  Check: 5261  Weight:  1.00

This contains multiple sequences with a shorter declared length, e.g.

 Name: DQB1*02:17       Len:   269  Check: 6596  Weight:  1.00
 Name: DQB1*02:18N      Len:    93  Check: 6792  Weight:  1.00
 Name: DQB1*02:19       Len:   269  Check: 5875  Weight:  1.00
 Name: DQB1*02:20N      Len:   126  Check: 7947  Weight:  1.00
 Name: DQB1*02:21       Len:   269  Check: 5917  Weight:  1.00

Here the truncation in the alignment block is much more severe,

$ grep -C 2 "DQB1\*02:19" DQB1_prot.msf
 Name: DQB1*02:17       Len:   269  Check: 6596  Weight:  1.00
 Name: DQB1*02:18N      Len:    93  Check: 6792  Weight:  1.00
 Name: DQB1*02:19       Len:   269  Check: 5875  Weight:  1.00
 Name: DQB1*02:20N      Len:   126  Check: 7947  Weight:  1.00
 Name: DQB1*02:21       Len:   269  Check: 5917  Weight:  1.00
--
--
     DQB1*02:17  .......... .......... .......... .......DFV YQFKGMCYFT
    DQB1*02:18N  .......... .......... .......... .......DFV YQFKGMCYFT
     DQB1*02:19  .......... .......... .......... .......DFV YQFKGMCYFT
    DQB1*02:20N  .......... .......... .......... .......DFV YQFKGMCYFT
     DQB1*02:21  .......... .......... .......... .......DFV YQFKGMCYFT
--
--
     DQB1*02:17  NGTERVRLVS RSIYNREEIV RFDSDVGEFR AVTLLGLPAT EYWNSQKDIL
    DQB1*02:18N  NGTERVRLVS RSIYNREEIV RFDSDVGEFR AVTLLGLPAA EYX
     DQB1*02:19  NGTERVRLVS RSIYNREEIV RFDSDVGEFR AVTLLGLPAA EYWNSQKDIL
    DQB1*02:20N  NGTERVRLVS RSIYNREEIV RFDSDVGEFR AVTLLGLPAA EYWNSQKDIL
     DQB1*02:21  NGTERVRLVS RSIYNREEIV RFDSDVGEFR AVRLLGLPAA EYWNSQKDIL
--
--
     DQB1*02:17  ERKRAAVDRV CRHNYQLELR TTLQRR.... .......... ..........

     DQB1*02:19  ERKPAAVDRV CRHNYQLELR TTLQRR.... .......... ..........
    DQB1*02:20N  ERKRAAVDRV CRHNYQLELR TTLQRX
     DQB1*02:21  ERKRAAVDRV CRHNYQLELR TTLQRR.... .......... ..........
--
--
     DQB1*02:17  .......... .......... .......... .......... ..........

     DQB1*02:19  .......... .......... .......... .......... ..........

     DQB1*02:21  .......... .......... .......... .......... ..........
--
--
     DQB1*02:17  .......... .......... .......... .......... ..........

     DQB1*02:19  .......... .......... .......... .......... ..........

     DQB1*02:21  .......... .......... .......... .......... ..........
--
--
     DQB1*02:17  .......... .........

     DQB1*02:19  .......... .........

     DQB1*02:21  .......... .........

Not only are these two sequences not dot padded, often their entire line is left blank without even the sequence identifier present! This is most problematic for straightforward parsing of the file, and does seem to be invalid formatting.

It is my understanding that the original GCG tools would use leading or trailing tilde (~) for gaps representing missing data, and dots (.) for internal gaps for alignment.

Edit: URL https://github.com/ANHIG/IMGTHLA/blob/3320/msf/DQB1_prot.msf

peterjc commented 5 years ago

Another example, older branch 3300 aka v3.30.0

https://github.com/ANHIG/IMGTHLA/blob/3300/msf/A_prot.msf

(Edit: Changed from raw download link)

jrob119 commented 5 years ago

For the MSF files, the sequence length varies for a number of reasons, this may be due to premature stop codons, unsequenced regions or deletions. Traditionally these were not been padded as this was not necessary for the program ReadSeq which was used to generate the MSF files. This is the reason for the varying lengths in earlier releases. Depending on the use of the MSF file, this was not causing an issues for our users. Later releases should show additionally padding to provide standardised lengths in the MSF entries, following a request. We need to look at the scope of the issue, and the need for the earlier files to be regenerated, as most sequences are available in later releases.

peterjc commented 5 years ago

Over the last few releases, it looks like only branch 3360 aka v3.36.0 has a problem (msf/C_gen.msf).

peterjc commented 5 years ago

I would think fixing msf/C_gen.msf from branch 3360 aka v3.36.0 (released 2019-04) would suffice - and fixing the root cause of this regression of course.

I see little value in retrospectively fixing the older releases without user interest.

jrob119 commented 5 years ago

The msg/C_gen.msf file for release 3.36.0 has been updated, the folder releases have been left.

peterjc commented 5 years ago

Thank you 🙏