edgraham / GhostKoalaParser

Parser for Ghost Koala
9 stars 5 forks source link

Errors when combining InterProScan and KEGG annotations #3

Closed brymerr921 closed 6 years ago

brymerr921 commented 6 years ago

Hi,

I'm using the following command (see below for command and output file excerpt) to merge KEGG and InterProScan (+ PANTHER) results and noticed the following:

  1. The gene_caller_id number is missing for all of the rows that have "KeggGhostKoala" as the source

  2. There's an extra row between the KEGG annotations and the InterProScan annotations. The row is: accession_id 0 KeggGhostKoala

  3. Several rows are identical (see entries for genes 1140, 1138). I don't know whether these will be imported as duplicates into Anvi'o. It might be nice (or potentially unnecessary) to output only unique rows in the output file.

  4. "gene_caller_id" needs to be "gene_callers_id" or Anvi'o won't accept it using the function anvi-import-functions.
    Error:

    Config Error: The file 'test_new_version_panther.txt' does not contain the right type of
              header. It was expected to have these: 'gene_callers_id, source, accession,
              function, e_value', however it had these: 'accession, e_value, function, source'
  5. Anvi'o seems to not like the order gene_callers_id accession e_value function source of the columns when combining InterProScan and KEGG annotations. With KEGG alone, the column order is:
    gene_caller_id source accession function e_value, as is recommended here. Anvi'o error:

    Config Error: Mapping funciton '<class 'float'>' did not like the value 'KeggGhostKoala' in
              column number 5 of the input matrix 'test_new_version_panther.txt' :/

Thanks for your help!

Best, Bryan

Command used: KEGG-to-anvio --KeggDB /home/bmerrill/Applications/GhostKoalaParser/samples/KO_Orthology_ko00001.txt -i user_ko.txt --interproscan interproscan-results-panther.txt -o test_new_version_panther.txt

Excerpt of test_new_version_panther.txt:

gene_caller_id  accession   e_value function    source
.......
        K21636  0        nrdD; ribonucleoside-triphosphate reductase (formate) [EC:1.1.98.6]    KeggGhostKoala
        K21636  0        nrdD; ribonucleoside-triphosphate reductase (formate) [EC:1.1.98.6]    KeggGhostKoala
        K21681  0        bcs1; ribitol-5-phosphate 2-dehydrogenase (NADP+) / D-ribitol-5-phosphate cytidylyltransferase [EC:1.1.1.405 2.7.7.40] KeggGhostKoala
        K22132  0               KeggGhostKoala
        accession_id    0               KeggGhostKoala
1140    PTHR35148       2.5E-13         PANTHER
1140    PTHR35148       2.5E-13         PANTHER
1140    PF14322 2.6E-12 Starch-binding associating with outer membrane  Pfam
1140    PF07980 1.8E-23 SusD family     Pfam
1140    G3DSA:1.25.40.10        2.2E-25         Gene3D
1140    SSF48452        3.29E-78                SUPERFAMILY
1140    SSF48452        3.29E-78                SUPERFAMILY
1140    SSF48452        3.29E-78                SUPERFAMILY
1038    PTHR22778       2.0E-35         PANTHER
1038    SSF53597        4.08E-50                SUPERFAMILY
1038    cd00209 7.05196E-65     DHFR    CDD
1038    PR00070 1.0E-12 Dihydrofolate reductase signature       PRINTS
1038    PR00070 1.0E-12 Dihydrofolate reductase signature       PRINTS
1038    PR00070 1.0E-12 Dihydrofolate reductase signature       PRINTS
1038    PS51330 52.876  Dihydrofolate reductase (DHFR) domain profile.  ProSiteProfiles
1038    PF00186 2.7E-57 Dihydrofolate reductase Pfam
edgraham commented 6 years ago

Whoops, that was on me! I forgot to take out a portion of the script to account for the quick fix I made this morning. I just fixed my fixed and it should produce the gene call ids now for KEGG (ping me again if it runs any errors!)

brymerr921 commented 6 years ago

Thanks for the fix. The gene numbers have appeared!

However, anvi-import-functions is still giving errors because of # 2 (extra row) and # 4 (gene_caller_id) mentioned above.

edgraham commented 6 years ago

So the gene callers id should now be fixed, but it doesn't produce an extra row on my machine when I use the example files you provided in the previous comment.

brymerr921 commented 6 years ago

Thanks so much. I think the extra row has to do with the header "contig accession_id" at the beginning of my user_ko.txt file. When I remove this header line in my "user_ko.txt", the extra line in the combined KEGG+interproscan output goes away and Anvi'o is happy.

edgraham commented 6 years ago

Ahh yes the output directly from GhostKoala doesn't have a header row so I didn't account for that when I wrote the parser.

brymerr921 commented 6 years ago

The instructions for the tutorial (http://merenlab.org/2018/01/17/importing-ghostkoala-annotations/) mention that it is necessary to add a header line to the user_ko.txt file that isn't there by default. I think the parser works great on the default user_ko.txt file downloaded straight from GhostKoala. I forgot to mention earlier that I was adding this header line while following the tutorial instructions... sorry about that!

"Now run this command on your terminal to add the necessary header line to this file:"

echo -e "contig\taccession_id" > .temp && cat user_ko.txt >> .temp && mv .temp user_ko.txt

In my hands, when I add this header line I get an extra line in the output (containing the header) and Anvi'o can't import the functions. When I remove the header line from user_ko.txt (or never add it), all is fine. Thanks for the fixes!

edgraham commented 6 years ago

I did notice that section of the tutorial after sending that. I’ll make a point to amend that so it isn’t an issue for others. Thanks for pointing these things out!

On Jan 30, 2018, at 12:43 PM, brymerr921 notifications@github.com wrote:

The instructions for the tutorial (http://merenlab.org/2018/01/17/importing-ghostkoala-annotations/) mention that it is necessary to add a header line to the user_ko.txt file that isn't there by default. I think the parser works great on the default user_ko.txt file downloaded straight from GhostKoala. I forgot to mention earlier that I was adding this header line while following the tutorial instructions.

"Now run this command on your terminal to add the necessary header line to this file:"

echo -e "contig\taccession_id" > .temp && cat user_ko.txt >> .temp && mv .temp user_ko.txt In my hands, when I add this header line I get an extra line in the output (containing the header) and Anvi'o can't import the functions. When I remove the header line from user_ko.txt (or never add it), all is fine. Thanks for the fixes!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

brymerr921 commented 6 years ago

You're welcome! Thanks for making this and making it available. It's already been very helpful and I'm sure many will use it.

kevinxchan commented 6 years ago

I think this is still an issue, as the tutorial still has that section (see here). However, as Bryan mentioned, running KEGG-to-anvio without adding in the headers works fine for me as well.