dib-lab / dammit

just annotate it, dammit!
http://dib-lab.github.io/dammit/
Other
89 stars 28 forks source link

0.2.9 hmmscan issue #60

Closed macmanes closed 8 years ago

macmanes commented 8 years ago

Easy to fix issue with the way you are handling the individual hmmscan jobs..

The command:

dammit annotate Mytilus.fasta \
--user-databases /mnt/data3/macmanes/dammit_databases/tcdb.fasta /mnt/data3/macmanes/dammit_databases/Crassostrea_gigas.GCA_000297895.1.31.ncrna.fa \
--busco-group metazoa --n_threads 35 --full --database-dir /mnt/data3/macmanes/dammit_databases/

The error

...
          [ ] hmmscan:longest_orfs.pep.x.Pfam-A.hmm

          [ ] remap_hmmer:longest_orfs.pep.pfam.tbl

Some tasks failed![dammit.annotate:ERROR]
TaskError - taskid:remap_hmmer:longest_orfs.pep.pfam.tbl[dammit.annotate:ERROR]
PythonAction Error
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/doit/action.py", line 383, in execute
    returned_value = self.py_callable(*self.args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dammit/tasks.py", line 472, in cmd
    hmmer_df = pd.concat(parsers.hmmscan_to_df_iter(hmmer_filename))
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 812, in concat
    copy=copy)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 842, in __init__
    objs = list(objs)
  File "/usr/local/lib/python2.7/dist-packages/dammit/parsers.py", line 361, in hmmscan_to_df_iter
    yield build_df(data)
  File "/usr/local/lib/python2.7/dist-packages/dammit/parsers.py", line 339, in build_df
    convert_dtypes(df, dict(hmmscan_cols))
  File "/usr/local/lib/python2.7/dist-packages/dammit/parsers.py", line 115, in convert_dtypes
    df[c] = df[c].astype(dtypes[c])
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2632, in astype
    dtype=dtype, copy=copy, raise_on_error=raise_on_error, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 2864, in astype
    return self.apply('astype', dtype=dtype, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 2823, in apply
    applied = getattr(b, f)(**kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 430, in astype
    values=values, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 472, in _astype
    values = com._astype_nansafe(values.ravel(), dtype, copy=True)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/common.py", line 2463, in _astype_nansafe
    return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
  File "pandas/lib.pyx", line 935, in pandas.lib.astype_intsafe (pandas/lib.c:16612)
  File "pandas/src/util.pxd", line 60, in util.set_value_at (pandas/lib.c:67514)
ValueError: invalid literal for long() with base 10: '5.3e+02'
[dammit.annotate:ERROR]

The issue:

looking at longest_orfs.pep.pfam.tbl you get some hmmscan header info, many empty lines, the tabular output you actually want. This pattern over and over again.

$more longest_orfs.pep.pfam.tbl
#                                                                                --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name        accession   tlen query name               accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
#------------------- ---------- -----     -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
#
# Program:         hmmscan
# Version:         3.1b1 (May 2013)
# Pipeline mode:   SCAN
# Query file:      -
# Target file:     /mnt/data3/macmanes/dammit_databases/Pfam-A.hmm
# Option settings: /usr/bin/hmmscan -o Mytilus.fasta.transdecoder_dir/longest_orfs.pep.pfam.tbl.out --domtblout Mytilus.fasta.transdecoder_dir/longest_orfs.pep.pfam.tbl -E 1e-05 --cpu 1 /mnt/data3/macmanes/dammit_databases/Pfam-A.hmm -
# Current dir:     /mouse/Mytilus/dammit/Mytilus.fasta.dammit
# Date:            Wed Apr 20 08:30:21 2016
# [ok]

...

Ig_3                 PF13927.2     69 Transcript_20063|m.13138 -            202   4.8e-09   36.6   0.9   1   3   4.3e-07    0.0014   19.1   0.1    12    67    10    91     2    94 0.78 Immunoglobulin domain
Ig_3                 PF13927.2     69 Transcript_20063|m.13138 -            202   4.8e-09   36.6   0.9   2   3   0.00023      0.76   10.3   0.0    11    28   120   141   111   167 0.76 Immunoglobulin domain
Ig_3                 PF13927.2     69 Transcript_20063|m.13138 -            202   4.8e-09   36.6   0.9   3   3     0.027        89    3.7   0.0    10    35   167   187   160   199 0.59 Immunoglobulin domain
...

turns out, even with --domtblout you get some extra stuff in there that you probably don't want when merging multiple runs together.

macmanes commented 8 years ago

Also, the longest_orfs.pep.pfam.tbl.out file is similarly messed up. I'm also concerned that the large chunks of empty lines are spaces were real results should appear. Only about 10% of my proteins are in the output.. The 90% seems like they got lost someplace in the mix.

camillescott commented 8 years ago

Yeah, I plan on pushing a new release today -- this error is fixed and tested in the current master branch. The fix is basically to write to /dev/stdout instead of to a file.

camillescott commented 8 years ago

This is now fixed in 0.3. A couple things to note though: