Closed peterjc closed 5 years ago
Going back to branch 3320, aka v3.32.0, another example:
$ head DQB1_prot.msf
!!AA_MULTIPLE_ALIGNMENT
MSF: 269 Type: P Apr 17, 2018 15:21 Check: 0 ..
Name: DQB1*02:01:01 Len: 269 Check: 6644 Weight: 1.00
Name: DQB1*02:01:02 Len: 269 Check: 5969 Weight: 1.00
Name: DQB1*02:01:03 Len: 269 Check: 5969 Weight: 1.00
Name: DQB1*02:01:04 Len: 269 Check: 5261 Weight: 1.00
Name: DQB1*02:01:05 Len: 269 Check: 5261 Weight: 1.00
Name: DQB1*02:01:06 Len: 269 Check: 5261 Weight: 1.00
This contains multiple sequences with a shorter declared length, e.g.
Name: DQB1*02:17 Len: 269 Check: 6596 Weight: 1.00
Name: DQB1*02:18N Len: 93 Check: 6792 Weight: 1.00
Name: DQB1*02:19 Len: 269 Check: 5875 Weight: 1.00
Name: DQB1*02:20N Len: 126 Check: 7947 Weight: 1.00
Name: DQB1*02:21 Len: 269 Check: 5917 Weight: 1.00
Here the truncation in the alignment block is much more severe,
$ grep -C 2 "DQB1\*02:19" DQB1_prot.msf
Name: DQB1*02:17 Len: 269 Check: 6596 Weight: 1.00
Name: DQB1*02:18N Len: 93 Check: 6792 Weight: 1.00
Name: DQB1*02:19 Len: 269 Check: 5875 Weight: 1.00
Name: DQB1*02:20N Len: 126 Check: 7947 Weight: 1.00
Name: DQB1*02:21 Len: 269 Check: 5917 Weight: 1.00
--
--
DQB1*02:17 .......... .......... .......... .......DFV YQFKGMCYFT
DQB1*02:18N .......... .......... .......... .......DFV YQFKGMCYFT
DQB1*02:19 .......... .......... .......... .......DFV YQFKGMCYFT
DQB1*02:20N .......... .......... .......... .......DFV YQFKGMCYFT
DQB1*02:21 .......... .......... .......... .......DFV YQFKGMCYFT
--
--
DQB1*02:17 NGTERVRLVS RSIYNREEIV RFDSDVGEFR AVTLLGLPAT EYWNSQKDIL
DQB1*02:18N NGTERVRLVS RSIYNREEIV RFDSDVGEFR AVTLLGLPAA EYX
DQB1*02:19 NGTERVRLVS RSIYNREEIV RFDSDVGEFR AVTLLGLPAA EYWNSQKDIL
DQB1*02:20N NGTERVRLVS RSIYNREEIV RFDSDVGEFR AVTLLGLPAA EYWNSQKDIL
DQB1*02:21 NGTERVRLVS RSIYNREEIV RFDSDVGEFR AVRLLGLPAA EYWNSQKDIL
--
--
DQB1*02:17 ERKRAAVDRV CRHNYQLELR TTLQRR.... .......... ..........
DQB1*02:19 ERKPAAVDRV CRHNYQLELR TTLQRR.... .......... ..........
DQB1*02:20N ERKRAAVDRV CRHNYQLELR TTLQRX
DQB1*02:21 ERKRAAVDRV CRHNYQLELR TTLQRR.... .......... ..........
--
--
DQB1*02:17 .......... .......... .......... .......... ..........
DQB1*02:19 .......... .......... .......... .......... ..........
DQB1*02:21 .......... .......... .......... .......... ..........
--
--
DQB1*02:17 .......... .......... .......... .......... ..........
DQB1*02:19 .......... .......... .......... .......... ..........
DQB1*02:21 .......... .......... .......... .......... ..........
--
--
DQB1*02:17 .......... .........
DQB1*02:19 .......... .........
DQB1*02:21 .......... .........
Not only are these two sequences not dot padded, often their entire line is left blank without even the sequence identifier present! This is most problematic for straightforward parsing of the file, and does seem to be invalid formatting.
It is my understanding that the original GCG tools would use leading or trailing tilde (~
) for gaps representing missing data, and dots (.
) for internal gaps for alignment.
Edit: URL https://github.com/ANHIG/IMGTHLA/blob/3320/msf/DQB1_prot.msf
Another example, older branch 3300 aka v3.30.0
https://github.com/ANHIG/IMGTHLA/blob/3300/msf/A_prot.msf
(Edit: Changed from raw download link)
For the MSF files, the sequence length varies for a number of reasons, this may be due to premature stop codons, unsequenced regions or deletions. Traditionally these were not been padded as this was not necessary for the program ReadSeq which was used to generate the MSF files. This is the reason for the varying lengths in earlier releases. Depending on the use of the MSF file, this was not causing an issues for our users. Later releases should show additionally padding to provide standardised lengths in the MSF entries, following a request. We need to look at the scope of the issue, and the need for the earlier files to be regenerated, as most sequences are available in later releases.
Over the last few releases, it looks like only branch 3360 aka v3.36.0 has a problem (msf/C_gen.msf
).
I would think fixing msf/C_gen.msf
from branch 3360 aka v3.36.0 (released 2019-04) would suffice - and fixing the root cause of this regression of course.
I see little value in retrospectively fixing the older releases without user interest.
The msg/C_gen.msf file for release 3.36.0 has been updated, the folder releases have been left.
Thank you 🙏
e.g. file
msf/C_gen.msf
on branch 3360 aka v3.36.0 which starts as follows:Amongst these, we find two sequences are unusually declared to be shorter at 4482:
Near the end of the file we find:
Sequences
C*05:206
andC*05:208N
are one letter shorter than the rest, clearly visible here as a missing final dot.Why are these two not given another trailing dot, giving the same number of letters (4483) as the rest?
Edit: URL https://github.com/ANHIG/IMGTHLA/blob/3360/msf/C_gen.msf