BUG: `split_search` does not split correctly in some cases

tkosciol commented 6 years ago

While benchmarking domain splitting between different approaches, I noticed that split_search script does not do some splits correctly.

Example on Barnacle: /projects/microprot/results/domain_benchmark/pkg/01-PDB/T0831.{out,match,non_match} HHSearch clearly hits 1 PDB which should be the hit (residues "1-419" in target). However, for an unknown reason, the script decided to match residues "1-80" (i.e. only the first line of alignment in out) and leave "81-419" as a non_match. The problem then continues in ../02-CM, where again we hit a single PDB with 100% probability, but the method only assigns match to residues "163-242" (i.e. only the second line*) and the rest is non_match

Parameters used:

split_PDB:
    params:
        min_prob: 95.0
        min_fragment_length: 40

split_CM:
    params:
        max_evalue: 0.1
        min_fragment_length: 40

or view the entire config on Barnacle in: /projects/microprot/results/domain_benchmark/config.yml

The fix is not very time-sensitive, but it would be nice to have before the end of the year.

tkosciol commented 6 years ago

similar situation for: /projects/microprot/results/domain_benchmark/pkg/03-Pfam/T0836_1-204.out I would expect "36-199" to be a match, while it's "84-199"

sjanssen2 commented 6 years ago

hm, I have a hard time to reproduce this error. @tkosciol could you try to fill the missing parts in https://github.com/sjanssen2/microprot/tree/fix-split such that the unit test would produce the wrong results as described above?

tkosciol commented 6 years ago

Sure thing! I will try to reproduce this error and get back to you ASAP. Likely on Friday, though. I’ve got a full day tomorrow.

On Dec 13, 2017, 19:35 +0100, Stefan Janssen notifications@github.com, wrote:

hm, I have a hard time to reproduce this error. @tkosciolhttps://github.com/tkosciol could you try to fill the missing parts in https://github.com/sjanssen2/microprot/tree/fix-split such that the unit test would produce the wrong results as described above?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/biocore/microprot/issues/71#issuecomment-351481624, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGZ0VtoIBhKuPGKbbAcQh8PTuPaIhppHks5tABjZgaJpZM4Q4z70.

sjanssen2 commented 6 years ago

from @tkosciol: Running on Barnacle:

/projects/microprot/results/domain_benchmark/PDB$python ../../../microprot/scripts/split_search.py -p 95 -l 40 T0831.out ../pkg/01-PDB/T0831.fasta
match
>T0831_1-80 # 4QN1_A HPRH, Homo sapiens, 419 residues
TMEELLTSLQKKCGTECEEAHRQLVCALNGLAGIHIIKGEYALAAELYREVLRSSEEHKGKLKTDSLQRLHATHNLMELL
non_match
>T0831_81-419 HPRH, Homo sapiens, 419 residues
IARHPGIPPTLRDGRLEEEAKQLREHYMSKCNTEVAEAQQALYPVQQTIHELQRKIHSNSPWWLNVIHRAIEFTIDEELVQRVRNEITSNYKQQTGKLSMSEKFRDCRGLQFLLTTQMEELNKCQKLVREAVKNLEGPPSRNVIESATVCHLRPARLPLNCCVFCKADELFTEYESKLFSNTVKGQTAIFEEMIEDEEGLVDDRAPTTTRGLWAISETERSMKAILSFAKSHRFDVEFVDEGSTSMDLFEAWKKEYKLLHEYWMALRNRVSAVDELAMATERLRVRDPREPKPNPPVLHIIEPHEVEQNRIKLLNDKAVATSQLQKKLGQLLYLTNLEK

Which is the wrong result. I also attach match file:

>T0831_1-80 # 4QN1_A SHPRH, Homo sapiens, 419 residues
TMEELLTSLQKKCGTECEEAHRQLVCALNGLAGIHIIKGEYALAAELYREVLRSSEEHKGKLKTDSLQRLHATHNLMELL

biocore / microprot

BUG: `split_search` does not split correctly in some cases #71