edgraham / GhostKoalaParser

Parser for Ghost Koala
9 stars 5 forks source link

InterProScan results can't be incorporated if PANTHER analysis is run, gene_caller_id is missing #2

Closed brymerr921 closed 6 years ago

brymerr921 commented 6 years ago

Hi,

I downloaded InterProScan from here (https://github.com/ebi-pf-team/interproscan/wiki/HowToDownload) and downloaded the PANTHER dataset as well (Step 2). I then used InterProScan to annotate several of my genomes. However, while this parser does work on GhostKOALA annotations alone, it does not work on my InterProScan anotations if the PANTHER files were available to InterProScan. The command I used to run InterProScan is:

./interproscan.sh -cpu 16 -f tsv --goterms --iprlookup --pathways -i protein-sequences.fa -o interproscan-results.txt

I also cloned this repository and am running the scripts inside a conda environment (python 2.7) with pandas 0.22.0 and Biopython 1.70 also installed. When I run KEGG-to-anvio, I get this error message:

KEGG-to-anvio --KeggDB KO_Orthology_ko00001.txt -i user_ko.txt -o KeggAnnotations-AnviImportable.txt --interproscan interproscan-results.txt
Traceback (most recent call last):
  File "/home/bmerrill/miniconda3/envs/ghostkoala/bin/KEGG-to-anvio", line 41, in <module>
    interpro = pd.read_table(arg_dict["interproscan"],header=None)
  File "/home/bmerrill/miniconda3/envs/ghostkoala/lib/python2.7/site-packages/pandas/io/parsers.py", line 709, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/bmerrill/miniconda3/envs/ghostkoala/lib/python2.7/site-packages/pandas/io/parsers.py", line 455, in _read
    data = parser.read(nrows)
  File "/home/bmerrill/miniconda3/envs/ghostkoala/lib/python2.7/site-packages/pandas/io/parsers.py", line 1069, in read
    ret = self._engine.read(nrows)
  File "/home/bmerrill/miniconda3/envs/ghostkoala/lib/python2.7/site-packages/pandas/io/parsers.py", line 1839, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 902, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 978, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 965, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2208, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 11 fields in line 3, saw 15

I've attached the files I used for these commands (with PANTHER enabled) that resulted in this error message. protein-sequences.fa.txt interproscan-results.txt user_ko.txt

However, when InterProScan no longer has access to the PANTHER files, the output file (KeggAnnotations-AnviImportable-nopanther.txt) is able to be parsed by GhostKOALA:

KEGG-to-anvio --KeggDB KO_Orthology_ko00001.txt -i user_ko.txt -o KeggAnnotations-AnviImportable-nopanther.txt --interproscan interproscan-results-nopanther.txt

However, looking at KeggAnnotations-AnviImportable-nopanther.txt it appears my gene_caller_id column has no entry for all rows rows that have the source "KeggGhostKoala".

InterProScan annotations with PANTHER disabled: interproscan-results-nopanther.txt

Output of KEGG-to-anvio (using above command): KeggAnnotations-AnviImportable-nopanther.txt

Do you have any suggestions for how to fix this? Thanks for the great parser, I'm excited to use it!

Best, Bryan

edgraham commented 6 years ago

Hello Bryan,

I believe I see where the issue is. I just updated the github version to account for this (it was a small issue that I hadn't run into since I wasn't using the Panther database!). Just pull the newest version down and you should be good to go! If you have further issues let me know!

-- Elaina

On Sun, Jan 28, 2018 at 10:14 PM, brymerr921 notifications@github.com wrote:

Hi,

I downloaded InterProScan from here (https://github.com/ebi-pf- team/interproscan/wiki/HowToDownload) and downloaded the PANTHER dataset as well (Step 2). I then used InterProScan to annotate several of my genomes. However, while this parser does work on GhostKOALA annotations alone, it does not work on my InterProScan anotations if the PANTHER files were available to InterProScan. The command I used to run InterProScan is:

./interproscan.sh -cpu 16 -f tsv --goterms --iprlookup --pathways -i protein-sequences.fa -o interproscan-results.txt

I also cloned this repository and am running the scripts inside a conda environment (python 2.7) with pandas 0.22.0 and Biopython 1.70 also installed. When I run KEGG-to-anvio, I get this error message:

KEGG-to-anvio --KeggDB KO_Orthology_ko00001.txt -i user_ko.txt -o KeggAnnotations-AnviImportable.txt --interproscan interproscan-results.txt Traceback (most recent call last): File "/home/bmerrill/miniconda3/envs/ghostkoala/bin/KEGG-to-anvio", line 41, in interpro = pd.read_table(arg_dict["interproscan"],header=None) File "/home/bmerrill/miniconda3/envs/ghostkoala/lib/python2.7/site-packages/pandas/io/parsers.py", line 709, in parser_f return _read(filepath_or_buffer, kwds) File "/home/bmerrill/miniconda3/envs/ghostkoala/lib/python2.7/site-packages/pandas/io/parsers.py", line 455, in _read data = parser.read(nrows) File "/home/bmerrill/miniconda3/envs/ghostkoala/lib/python2.7/site-packages/pandas/io/parsers.py", line 1069, in read ret = self._engine.read(nrows) File "/home/bmerrill/miniconda3/envs/ghostkoala/lib/python2.7/site-packages/pandas/io/parsers.py", line 1839, in read data = self._reader.read(nrows) File "pandas/_libs/parsers.pyx", line 902, in pandas._libs.parsers.TextReader.read File "pandas/_libs/parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory File "pandas/_libs/parsers.pyx", line 978, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 965, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 2208, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 11 fields in line 3, saw 15

I've attached the files I used for these commands (with PANTHER enabled) that resulted in this error message. protein-sequences.fa.txt https://github.com/edgraham/GhostKoalaParser/files/1672419/protein-sequences.fa.txt interproscan-results.txt https://github.com/edgraham/GhostKoalaParser/files/1672421/interproscan-results.txt user_ko.txt https://github.com/edgraham/GhostKoalaParser/files/1672422/user_ko.txt

However, when InterProScan no longer has access to the PANTHER files, the output file is able to be parsed by GhostKOALA and everything works as expected:

KEGG-to-anvio --KeggDB KO_Orthology_ko00001.txt -i user_ko.txt -o KeggAnnotations-AnviImportable.txt --interproscan interproscan-results-nopanther.txt

InterProScan annotations with PANTHER disabled: interproscan-results-nopanther.txt https://github.com/edgraham/GhostKoalaParser/files/1672433/interproscan-results-nopanther.txt

Best, Bryan

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/edgraham/GhostKoalaParser/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AQXxo_I6rq6b4A4aVXHybQdhHxjRwCVRks5tPWG4gaJpZM4RwHCS .

brymerr921 commented 6 years ago

Thanks, this works great when PANTHER results are present!