TimoLassmann / kalign

A fast multiple sequence alignment program.
GNU General Public License v3.0
128 stars 28 forks source link

Version 3.4.0 seems to break clustal format parsers #41

Closed biomadeira closed 6 months ago

biomadeira commented 7 months ago

Dear Timo

I have noticed that with 3.4.0 the -f clu output is noticeably different than with the previous version (tested 3.3.1). This breaks our Clustal format parsers and prevents us from upgrading to version 3.4.0.

I think parsers are generally expecting the "CLUSTAL multiple sequence alignment..." header, which has been updated to "Kalign (3.4.0) multiple sequence alignment". Additionally, the new format provides much longer seq names that seems the main reason for our parsers to fail. For example "sp|P69905|HBA_HUMAN MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG" is now outputted as "sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens GN=HBA1 PE=1 SV=2 MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG"

Best regards!

TimoLassmann commented 7 months ago

Thanks for reporting.

Can you clarify: is the issue that the older version shortened the sequence names as present in the input fasta file, while the new(er) version doesn't? I can add an option to modify this behavior but it would be difficult to come up with a uniform rule to deal with long sequence names. Users may need the long names for downstream pipelines.

biomadeira commented 7 months ago

I only noticed this with 3.4.0. For example, using the same input sequences, with version 3.3.1 we get this alignment

CLUSTAL multiple sequence alignment by Kalign (3.3.1)

sp|P69905|HBA_HUMAN      MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
sp|P01942|HBA_MOUSE      MVLSGEDKSNIKAAWGKIGGHGAEYGAEALERMFASFPTTKTYFPHFDVSHGSAQVKGHG
sp|P13786|HBAZ_CAPHI     MSLTRTERTIILSLWSKISTQADVIGTETLERLFSCYPQAKTYFPHFDLHSGSAQLRAHG

sp|P69905|HBA_HUMAN      KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP
sp|P01942|HBA_MOUSE      KKVADALASAAGHLDDLPGALSALSDLHAHKLRVDPVNFKLLSHCLLVTLASHHPADFTP
sp|P13786|HBAZ_CAPHI     SKVVAAVGDAVKSIDNVTSALSKLSELHAYVLRVDPVNFKFLSHCLLVTLASHFPADFTA

sp|P69905|HBA_HUMAN      AVHASLDKFLASVSTVLTSKYR
sp|P01942|HBA_MOUSE      AVHASLDKFLASVSTVLTSKYR
sp|P13786|HBAZ_CAPHI     DAHAAWDKFLSIVSGVLTEKYR

whereas with 3.4.0 we get

Kalign (3.4.0) multiple sequence alignment

sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens GN=HBA1 PE=1 SV=2      MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
sp|P01942|HBA_MOUSE Hemoglobin subunit alpha OS=Mus musculus GN=Hba PE=1 SV=2       MVLSGEDKSNIKAAWGKIGGHGAEYGAEALERMFASFPTTKTYFPHFDVSHGSAQVKGHG
sp|P13786|HBAZ_CAPHI Hemoglobin subunit zeta OS=Capra hircus GN=HBZ1 PE=3 SV=2      MSLTRTERTIILSLWSKISTQADVIGTETLERLFSCYPQAKTYFPHFDLHSGSAQLRAHG

sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens GN=HBA1 PE=1 SV=2      KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP
sp|P01942|HBA_MOUSE Hemoglobin subunit alpha OS=Mus musculus GN=Hba PE=1 SV=2       KKVADALASAAGHLDDLPGALSALSDLHAHKLRVDPVNFKLLSHCLLVTLASHHPADFTP
sp|P13786|HBAZ_CAPHI Hemoglobin subunit zeta OS=Capra hircus GN=HBZ1 PE=3 SV=2      SKVVAAVGDAVKSIDNVTSALSKLSELHAYVLRVDPVNFKFLSHCLLVTLASHFPADFTA

sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens GN=HBA1 PE=1 SV=2      AVHASLDKFLASVSTVLTSKYR
sp|P01942|HBA_MOUSE Hemoglobin subunit alpha OS=Mus musculus GN=Hba PE=1 SV=2       AVHASLDKFLASVSTVLTSKYR
sp|P13786|HBAZ_CAPHI Hemoglobin subunit zeta OS=Capra hircus GN=HBZ1 PE=3 SV=2      DAHAAWDKFLSIVSGVLTEKYR

I understand the extended seq header might be useful for some people, but it breaks our parsers, and perhaps others - I haven't checked others. The original sequence headers should be available from the input sequences anyway so perhaps not much needed. In any case, if you believe the output still conforms with Clustal format, then you can close this I will check what we can do from our end.

TimoLassmann commented 7 months ago

Hi, Makes sense. I haven't looked into the parsing code in a while and therefore it will take some time to add an option for this. For now, if your input is fasta you could remove the long names before calling kalign: sed -i '/^>/ s/ .*//' test.fa