Closed tkosciol closed 7 years ago
could you provide full input and output files plus precise execution commands? I have the feeling that you want the script to interpret file name parts as start and stop positions and use that to vary the output file names. I am not sure if we are clever enough to code that up for all situations and it might be a source of very unpredictive behaviour "just" for the purpose of more intuitively readable filenames. Might be a too big sacrifice?!
You've got a point, but in principle, it should be fairly easy, as we should be looking at the last _
.
Example: 1)
cd /projects/microprot/benchmarking/snakemake_test
python /projects/microprot/microprot/scripts/split_search.py -p 0.95 -l 10 ./01-search_pdb/2.out 2.faa -o ./02-split_pdb/2
produces:
cat 02-split_pdb/2.non_match
>NZ_GG666849.1_2_1-222 # 798 # 2885 # -1 # ID=1_2;partial=00;start_type=TTG;rbs_motif=AGxAGG/AGGxGG;rbs_spacer=5-10bp;gc_cont=0.499
MKRLLTNPLEKQMGRSVNDSELEHFMLRYVLAYPTVYVVYSGKTNRYSMKNEYTVYVGETNDIRHRTFQHLERDAKEREDWKAVADAVSRNDDAYRQYVISHPKFNKSLTLDVENRLMHYMSSTDSVKRLNNRRTNAQGDYYTADQFDRIFSQIWLRLHKEDSDLFPAEEIIRDSALFKASPFHKLSDSQLEAEESILVQLAALVADSRQQKKADEQSLDRH
>NZ_GG666849.1_2_251-330 # 798 # 2885 # -1 # ID=1_2;partial=00;start_type=TTG;rbs_motif=AGxAGG/AGGxGG;rbs_spacer=5-10bp;gc_cont=0.499
IGIQGDTYSEDEDYPELPRTANGRLSSYILVNHKEQVHVYNQIATKLGLQKESGEVVMLPSQFINRFSLRNEHGRGIPDQ
>NZ_GG666849.1_2_381-696 # 798 # 2885 # -1 # ID=1_2;partial=00;start_type=TTG;rbs_motif=AGxAGG/AGGxGG;rbs_spacer=5-10bp;gc_cont=0.499
QSSQQWDPAVLEKLQLDSDSTQTACATGKIGAFKAVDLGAEYHLPSISVDVAHVHLKKQFRIAASKSMIQWIDRFASGKGIGPLPVDTGEYAADGEVIREPYDVKVFDSPVDLFNAIHAKAQLKPDGWDGAGLSRLLATYDWKYSAKSENQNDPHGFWNVELHRDTSGVWRMGLADDHGFDHSAISSDQFCKPWNNQLEDSAPKSRKGIDKELAWAEKPYTIDEIGSIFTIQGFDLNYAGVIIGPSATYRNGEIVFDAKASQNYLATNKRKDLGDFAEENLKHELNVLLKRGVHGLYLFAVDEGLQQRLKECCIH*
2)
# first, we split non_match
python /projects/microprot/microprot/scripts/process_fasta.py -i 02-split_pdb/2.non_match --split -o 02-split_pdb/2
# then, we do hhsearch for each output fasta, e.g.
hhsearch -i ./02-split_pdb/2/NZ_GG666849.1_2_251-330.fasta -d /projects/microprot/dbs/pdb70/pdb70 -e 0.1 -o /03-search_cm/NZ_GG666849.1_2_251-330.out -oa3m ./03-search_cm/NZ_GG666849.1_2_251-330.a3m
# finally, I do split_search on hhsearch output
python /projects/microprot/microprot/scripts/split_search.py 03-search_cm/NZ_GG666849.1_2_251-330.out 02-split_pdb/2/NZ_GG666849.1_2_251-330.fasta -o 04-split_cm/out -e 0.1 -l 40
produces:
cat 04-split_cm/out.non_match
>NZ_GG666849.1_2_251-330_1-80 # 798 # 2885 # -1 # ID=1_2;partial=00;start_type=TTG;rbs_motif=AGxAGG/AGGxGG;rbs_spacer=5-10bp;gc_cont=0.499
IGIQGDTYSEDEDYPELPRTANGRLSSYILVNHKEQVHVYNQIATKLGLQKESGEVVMLPSQFINRFSLRNEHGRGIPDQ
whereas I'd expect:
>NZ_GG666849.1_2_251-330 # 798 # 2885 # -1 # ID=1_2;partial=00;start_type=TTG;rbs_motif=AGxAGG/AGGxGG;rbs_spacer=5-10bp;gc_cont=0.499
IGIQGDTYSEDEDYPELPRTANGRLSSYILVNHKEQVHVYNQIATKLGLQKESGEVVMLPSQFINRFSLRNEHGRGIPDQ
example: input file is called
AB001_2_100-250.fasta
. Aftersplit_search
we find residues50-100
in this sequence matching, so outputs should be:AB001_2_150-200
inmatch
andAB001_2_100-149
andAB001_2_201-250
innon_match
.NOT:
AB001_2_100-250_50-100
inmatch
, etc.