WatsonLab / PULpy

Open prediction of Polysaccharide Utilisation Loci (PUL)
20 stars 9 forks source link

problem with GH prediction #1

Closed arielamadio closed 5 years ago

arielamadio commented 5 years ago

I was trying to run the pipeline and not worked properly. It worked for the prediction of sus genes, but not for GH results (dbcan) Trying to understand the pipeline, I found that hmmscan-parser.sh can't be run on the tabular output of hmmscan (-domtblout), and should be used with standard output, and the pipeline could not create files .dm.ps and dm.ps.filtered I modified the snakefile to work with the original results, and worked, but broked the next steps since the tabular out is not what predict_puls.R expect. I'm I missing something? Can you please check? Also, the website for DBCAN changed to http://cys.bios.niu.edu/dbCAN/download/ Or the version of hmmscan-parser.sh in that website is different? Thanks Ariel

mw55309 commented 5 years ago

Yes, so that is the correct version of dbCAN (version 1).

Checking my results, hmmscan-parser.sh does run on the .dm output:

$ ./dbcan_data/hmmscan-parser.sh dbcan/GCA_900177595.1_Emticicia_sp._MM.out.dm
CE1.hmm 227     FXVG01000001.1_110      277     5.3e-11 27      210     29      259     0.806167400881057
GT2.hmm 168     FXVG01000001.1_121      233     3.5e-19 1       130     4       127     0.767857142857143
GT2.hmm 168     FXVG01000001.1_1        215     1.4e-17 3       165     8       177     0.964285714285714
AA3.hmm 618     FXVG01000001.1_125      572     3.3e-50 86      401     9       566     0.509708737864078
CE6.hmm 99      FXVG01000001.1_151      982     5.9e-08 45      97      240     291     0.525252525252525
CE6.hmm 99      FXVG01000001.1_152      904     8.8e-07 35      97      224     287     0.626262626262626
GT81.hmm        293     FXVG01000001.1_16       458     8.7e-57 23      267     2       226     0.832764505119454
CE1.hmm 227     FXVG01000001.1_26       291     2.1e-06 83      190     129     257     0.47136563876652
CE1.hmm 227     FXVG01000001.1_32       262     9.2e-47 1       221     22      254     0.969162995594714
GH63.hmm        570     FXVG01000001.1_92       881     5.1e-20 355     564     423     873     0.366666666666667
GT28.hmm        157     FXVG01000002.1_4        375     6e-08   57      124     255     324     0.426751592356688
CE1.hmm 227     FXVG01000003.1_15       268     4.2e-41 1       220     24      257     0.964757709251101
GH63.hmm        570     FXVG01000003.1_20       443     1.5e-42 348     565     29      441     0.380701754385965
GH95.hmm        722     FXVG01000004.1_12       980     2.8e-230        91      719     322     962     0.869806094182826
CE6.hmm 99      FXVG01000004.1_12       980     7.3e-08 11      95      79      172     0.848484848484849
GH117.hmm       211     FXVG01000004.1_13       366     2.2e-17 42      116     105     190     0.350710900473934
GH117.hmm       211     FXVG01000004.1_13       366     7.5e-23 50      209     192     347     0.753554502369668
GH117.hmm       211     FXVG01000004.1_14       342     3.5e-18 40      113     96      169     0.345971563981043
GH117.hmm       211     FXVG01000004.1_14       342     5.2e-30 48      210     170     324     0.767772511848341
GH29.hmm        346     FXVG01000004.1_5        414     6.2e-75 35      340     46      383     0.88150289017341
GH95.hmm        722     FXVG01000004.1_7        761     8.3e-100        307     717     297     723     0.56786703601108
GH29.hmm        346     FXVG01000004.1_8        558     4.5e-58 33      335     19      325     0.872832369942196

Which version of HMMSCAN are you using? Ours is:

$ hmmscan -h
# hmmscan :: search sequence(s) against a profile database
# HMMER 3.2 (June 2018); http://hmmer.org/
# Copyright (C) 2018 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

I should alter the YAML to code the versions I think....

Mick

arielamadio commented 5 years ago

That's not what I get as result

When I run hmmscan-parser.sh on the .dm result I get no output

On the original hmmscan output I get a different table

hmmscan-parser.sh Bacteroide.out

NODE_10_length_225403_cov_25.8012_117 CBM40.hmm 0.0001 88 174 242 323 0.480446927374302 NODE_10_length_225403_cov_25.8012_118 GH18.hmm 1.7e-13 48 211 58 268 0.550675675675676 NODE_10_length_225403_cov_25.8012_123 GH3.hmm 8.7e-71 2 215 123 350 0.986111111111111 NODE_10_length_225403_cov_25.8012_124 CE3.hmm 3.4e-06 88 188 231 336 0.515463917525773 NODE_10_length_225403_cov_25.8012_128 CBM40.hmm 0.00012 87 174 235 322 0.486033519553073 NODE_10_length_225403_cov_25.8012_129 CBM32.hmm 5.2e-11 14 122 216 331 0.870967741935484 NODE_10_length_225403_cov_25.8012_130 GH18.hmm 2.4e-13 77 211 114 271 0.452702702702703 NODE_10_length_225403_cov_25.8012_15 CBM44.hmm 6e-06 18 57 236 273 0.609375 NODE_10_length_225403_cov_25.8012_156 GT4.hmm 5.3e-28 7 149 198 330 0.8875 NODE_10_length_225403_cov_25.8012_157 GT4.hmm 1.6e-32 8 153 169 314 0.90625 NODE_10_length_225403_cov_25.8012_159 GT4.hmm 1.1e-28 12 154 227 364 0.8875

So the table looks different. I will check if the problem is hmmscan-parser.sh or I also have version 3.1b1 of hmmer, I will update and see.

The hmmscan-parser.sh script is:

!/usr/bin/env sh

Yanbin Yin

08/18/2011

hmmscan output parser

Usage: sh hmmscan-parser.sh hmmscan-output-file

1. take hmmer3 output and generate the tabular output

2. sort on the 6th and 7th cols

3. remove overlapped/redundant hmm matches and keep the one with the lower e-values

4. calculate the covered fraction of hmm (make sure you have downloaded the "all.hmm.ps.len" file to the same directory of this perl script)

5. apply the E-value cutoff and the covered faction cutoff

cat $1 | perl -e 'while(>){if(/^\/\//){$x=join("",@a);($q)=($x=~/^Query:\s+(\S+)/m);while($x=~/^> (\S+.*?\n\n)/msg){$a=$&;@c=split(/\n/,$a);$c[0]=~s/>> //;for($i=3;$i<=$#c;$i++){@d=split(/\s+/,$c[$i]);print $q."\t".$c[0]."\t$d[6]\t$d[7]\t$d[8]\t$d[10]\t$d[11]\n" if $d[6]<1;}}@a=();}else{push(@a,$);}}' \ | sort -k 1,1 -k 6,6n -k 7,7n | uniq \ | perl -e 'while(<>){chomp;@a=split;next if $a[-1]==$a[-2];push(@{$b{$a[0]}},$);}foreach(sort keys %b){@a=@{$b{$}};for($i=0;$i$#a;$i++){@b=split(/\t/,$a[$i]);@c=split(/\t/,$a[$i+1]);$len1=$b[-1]-$b[-2];$len2=$c[-1]-$c[-2];$len3=$b[-1]-$c[-2];if($len30 and ($len3/$len1>0.5 or $len3/$len2>0.5)){if($b[2]<$c[2]){splice(@a,$i+1,1);}else{splice(@a,$i,1);}$i=$i-1;}}foreach(@a){print $."\n";}}' \ | uniq | perl -e 'open(IN,"all.hmm.ps.len");while(IN>){chomp;@a=split;$b{$a[0]}=$a[1];}while(<){chomp;@a=split;$r=($a[4]-$a[3])/$b{$a[1]};print $."\t".$r."\n";}' \ | perl -e 'while(<>){@a=split(/\t/,$);if(($a[-1]-$a[-2])>80){print $ if $a[2]<1e-5;}else{print $ if $a[2]<1e-3;}}' | awk '$NF>0.3'


De: Mick Watson notifications@github.com Enviado: miércoles, 17 de octubre de 2018 12:46 Para: WatsonLab/PULpy Cc: Ariel Amadio; Author Asunto: Re: [WatsonLab/PULpy] problem with GH prediction (#1)

Yes, so that is the correct version of dbCAN (version 1).

Checking my results, hmmscan-parser.sh does run on the .dm output:

$ ./dbcan_data/hmmscan-parser.sh dbcan/GCA_900177595.1_Emticicia_sp._MM.out.dm CE1.hmm 227 FXVG01000001.1_110 277 5.3e-11 27 210 29 259 0.806167400881057 GT2.hmm 168 FXVG01000001.1_121 233 3.5e-19 1 130 4 127 0.767857142857143 GT2.hmm 168 FXVG01000001.1_1 215 1.4e-17 3 165 8 177 0.964285714285714 AA3.hmm 618 FXVG01000001.1_125 572 3.3e-50 86 401 9 566 0.509708737864078 CE6.hmm 99 FXVG01000001.1_151 982 5.9e-08 45 97 240 291 0.525252525252525 CE6.hmm 99 FXVG01000001.1_152 904 8.8e-07 35 97 224 287 0.626262626262626 GT81.hmm 293 FXVG01000001.1_16 458 8.7e-57 23 267 2 226 0.832764505119454 CE1.hmm 227 FXVG01000001.1_26 291 2.1e-06 83 190 129 257 0.47136563876652 CE1.hmm 227 FXVG01000001.1_32 262 9.2e-47 1 221 22 254 0.969162995594714 GH63.hmm 570 FXVG01000001.1_92 881 5.1e-20 355 564 423 873 0.366666666666667 GT28.hmm 157 FXVG01000002.1_4 375 6e-08 57 124 255 324 0.426751592356688 CE1.hmm 227 FXVG01000003.1_15 268 4.2e-41 1 220 24 257 0.964757709251101 GH63.hmm 570 FXVG01000003.1_20 443 1.5e-42 348 565 29 441 0.380701754385965 GH95.hmm 722 FXVG01000004.1_12 980 2.8e-230 91 719 322 962 0.869806094182826 CE6.hmm 99 FXVG01000004.1_12 980 7.3e-08 11 95 79 172 0.848484848484849 GH117.hmm 211 FXVG01000004.1_13 366 2.2e-17 42 116 105 190 0.350710900473934 GH117.hmm 211 FXVG01000004.1_13 366 7.5e-23 50 209 192 347 0.753554502369668 GH117.hmm 211 FXVG01000004.1_14 342 3.5e-18 40 113 96 169 0.345971563981043 GH117.hmm 211 FXVG01000004.1_14 342 5.2e-30 48 210 170 324 0.767772511848341 GH29.hmm 346 FXVG01000004.1_5 414 6.2e-75 35 340 46 383 0.88150289017341 GH95.hmm 722 FXVG01000004.1_7 761 8.3e-100 307 717 297 723 0.56786703601108 GH29.hmm 346 FXVG01000004.1_8 558 4.5e-58 33 335 19 325 0.872832369942196

Which version of HMMSCAN are you using? Ours is:

$ hmmscan -h

hmmscan :: search sequence(s) against a profile database

HMMER 3.2 (June 2018); http://hmmer.org/

Copyright (C) 2018 Howard Hughes Medical Institute.

Freely distributed under the BSD open source license.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

I should alter the YAML to code the versions I think....

Mick

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/WatsonLab/PULpy/issues/1#issuecomment-430682114, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AghLUniS03zjVepit-EUCqEalh2whkTIks5ul1DbgaJpZM4XkNEz.

mw55309 commented 5 years ago

If you send me some example input files (mick.watson@roslin.ed.ac.uk) I can double check it works in your data on our system :)