Closed biomadeira closed 6 months ago
Thanks for reporting.
Can you clarify: is the issue that the older version shortened the sequence names as present in the input fasta file, while the new(er) version doesn't? I can add an option to modify this behavior but it would be difficult to come up with a uniform rule to deal with long sequence names. Users may need the long names for downstream pipelines.
I only noticed this with 3.4.0. For example, using the same input sequences, with version 3.3.1 we get this alignment
CLUSTAL multiple sequence alignment by Kalign (3.3.1)
sp|P69905|HBA_HUMAN MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
sp|P01942|HBA_MOUSE MVLSGEDKSNIKAAWGKIGGHGAEYGAEALERMFASFPTTKTYFPHFDVSHGSAQVKGHG
sp|P13786|HBAZ_CAPHI MSLTRTERTIILSLWSKISTQADVIGTETLERLFSCYPQAKTYFPHFDLHSGSAQLRAHG
sp|P69905|HBA_HUMAN KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP
sp|P01942|HBA_MOUSE KKVADALASAAGHLDDLPGALSALSDLHAHKLRVDPVNFKLLSHCLLVTLASHHPADFTP
sp|P13786|HBAZ_CAPHI SKVVAAVGDAVKSIDNVTSALSKLSELHAYVLRVDPVNFKFLSHCLLVTLASHFPADFTA
sp|P69905|HBA_HUMAN AVHASLDKFLASVSTVLTSKYR
sp|P01942|HBA_MOUSE AVHASLDKFLASVSTVLTSKYR
sp|P13786|HBAZ_CAPHI DAHAAWDKFLSIVSGVLTEKYR
whereas with 3.4.0 we get
Kalign (3.4.0) multiple sequence alignment
sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens GN=HBA1 PE=1 SV=2 MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
sp|P01942|HBA_MOUSE Hemoglobin subunit alpha OS=Mus musculus GN=Hba PE=1 SV=2 MVLSGEDKSNIKAAWGKIGGHGAEYGAEALERMFASFPTTKTYFPHFDVSHGSAQVKGHG
sp|P13786|HBAZ_CAPHI Hemoglobin subunit zeta OS=Capra hircus GN=HBZ1 PE=3 SV=2 MSLTRTERTIILSLWSKISTQADVIGTETLERLFSCYPQAKTYFPHFDLHSGSAQLRAHG
sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens GN=HBA1 PE=1 SV=2 KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP
sp|P01942|HBA_MOUSE Hemoglobin subunit alpha OS=Mus musculus GN=Hba PE=1 SV=2 KKVADALASAAGHLDDLPGALSALSDLHAHKLRVDPVNFKLLSHCLLVTLASHHPADFTP
sp|P13786|HBAZ_CAPHI Hemoglobin subunit zeta OS=Capra hircus GN=HBZ1 PE=3 SV=2 SKVVAAVGDAVKSIDNVTSALSKLSELHAYVLRVDPVNFKFLSHCLLVTLASHFPADFTA
sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens GN=HBA1 PE=1 SV=2 AVHASLDKFLASVSTVLTSKYR
sp|P01942|HBA_MOUSE Hemoglobin subunit alpha OS=Mus musculus GN=Hba PE=1 SV=2 AVHASLDKFLASVSTVLTSKYR
sp|P13786|HBAZ_CAPHI Hemoglobin subunit zeta OS=Capra hircus GN=HBZ1 PE=3 SV=2 DAHAAWDKFLSIVSGVLTEKYR
I understand the extended seq header might be useful for some people, but it breaks our parsers, and perhaps others - I haven't checked others. The original sequence headers should be available from the input sequences anyway so perhaps not much needed. In any case, if you believe the output still conforms with Clustal format, then you can close this I will check what we can do from our end.
Hi,
Makes sense. I haven't looked into the parsing code in a while and therefore it will take some time to add an option for this. For now, if your input is fasta you could remove the long names before calling kalign:
sed -i '/^>/ s/ .*//' test.fa
Dear Timo
I have noticed that with 3.4.0 the
-f clu
output is noticeably different than with the previous version (tested 3.3.1). This breaks our Clustal format parsers and prevents us from upgrading to version 3.4.0.I think parsers are generally expecting the "CLUSTAL multiple sequence alignment..." header, which has been updated to "Kalign (3.4.0) multiple sequence alignment". Additionally, the new format provides much longer seq names that seems the main reason for our parsers to fail. For example "sp|P69905|HBA_HUMAN MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG" is now outputted as "sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens GN=HBA1 PE=1 SV=2 MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG"
Best regards!