biocore / microprot

structural annotation pipeline for microbial genomes and metagenomes
BSD 3-Clause "New" or "Revised" License
1 stars 6 forks source link

BUG: split_search skips non_match sequences if --frag-len too large #42

Closed tkosciol closed 7 years ago

tkosciol commented 7 years ago

split_search.py produces no output, i.e. 0 match and non_match files for this outfile:

Query         NZ_GG666849.1_2_251-330 # 798 # 2885 # -1 # ID=1_2;partial=00;start_type=TTG;rbs_motif=AGxAGG/AGGxGG;rbs_spacer=5-10bp;gc_cont=0.499
Match_columns 80
No_of_seqs    1 out of 1
Neff          1
Searched_HMMs 36595
Date          Sun May 21 18:41:09 2017
Command       hhsearch -i ./02-split_pdb/2/NZ_GG666849.1_2_251-330.fasta -d /projects/microprot/dbs/pdb70/pdb70 -e 0.1 -o ./03-search_cm/NZ_GG666849.1_2_251-330.out -oa3m ./03-search_cm/NZ_GG666849.1_2_251-330.a3m

 No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
  1 2j27_A Triosephosphate isomera  27.5      24 0.00065   24.6   0.0   30   26-55     88-121 (250)
  2 4y96_A Triosephosphate isomera  26.6      25  0.0007   24.7   0.0   29   26-54     92-124 (255)
  3 4o4v_A Triosephosphate isomera  26.5      26  0.0007   24.6   0.0   30   26-55     87-120 (252)

I used parameters: split_search.py -e 0.1 -l 40 expected behavior: there are no match sequences, so the entire input sequence goes to non_match, because it's longer than 40 residues

sjanssen2 commented 7 years ago

Either I don't understand the problem, or I am not able to reproduce it properly. The following execution (microProt) barnacle x86_64 ~/MicroProt/microprot/microprot/scripts>python split_search.py -o testme -e 0.1 -l 40 /projects/microprot/benchmarking/snakemake_test/03-search_cm/NZ_GG666849.1_2_251-330.out /projects/microprot/benchmarking/snakemake_test/02-split_pdb/2/NZ_GG666849.1_2_251-330.fasta

Leads to the creation of two files: testme.match which is empty and testme.non_match with the content:

>NZ_GG666849.1_2_251-330_1-80 # 798 # 2885 # -1 # ID=1_2;partial=00;start_type=TTG;rbs_motif=AGxAGG/AGGxGG;rbs_spacer=5-10bp;gc_cont=0.499
IGIQGDTYSEDEDYPELPRTANGRLSSYILVNHKEQVHVYNQIATKLGLQKESGEVVMLPSQFINRFSLRNEHGRGIPDQ

Can you elaborate on how you produced the unwanted behaviour and what exactly it is?

tkosciol commented 7 years ago

wow... ok, it seems that you got the correct behavior. You understand my problem correctly. Let me try to reproduce my error once again

sjanssen2 commented 7 years ago

I have the feeling that - if you specify --subseq_fp - no files are produced

sjanssen2 commented 7 years ago

I double checked. No, those files get produced.

tkosciol commented 7 years ago

I messed up. Indeed,

cd /projects/microprot/benchmarking/snakemake_test
python /projects/microprot/microprot/scripts/split_search.py 03-search_cm/NZ_GG666849.1_2_251-330.out 02-split_pdb/2/NZ_GG666849.1_2_251-330.fasta -o 04-split_cm/x -e 0.1 -l 40

works just fine.

The reason why my example didn't work is that I just gave folder path (-o 04-split_cm/), not folder and out_root.

sjanssen2 commented 7 years ago

I was also a bit surprised, that no files where generated in directory -o root/04-split_cm, but there where files root/04-split_cm.match and root/04-split_cm.nonmatch. Should we change the behaviour of our function, if even the developer get's confused?

tkosciol commented 7 years ago

probably a good idea! :)

sjanssen2 commented 7 years ago

I was also a bit surprised, that no files where generated in directory -o root/04-split_cm, but there where files root/04-split_cm.match and root/04-split_cm.nonmatch. Should we change the behaviour of our function, if even the developer get's confused?

tkosciol commented 7 years ago

resolved by #44