AnantharamanLab / METABOLIC

A scalable high-throughput metabolic and biogeochemical functional trait profiler
178 stars 45 forks source link

Discrepancy HMMHit and FunctionHit results (v 4.0) #129

Closed jampoa closed 1 year ago

jampoa commented 1 year ago

Hello, There are some inconsistencies between my HMMHit and FunctionHit results that I don't understand. Some functions that consist of only one gene/HMM file are absent from all of my bins according to the FunctionHit spreadsheet, while that same gene is actually present in several bins according to the HMMHit spreadsheet (e.g. "acs" is absent from all bins according to FunctionHit, while it is present in 70 bins according to HMMHit). I am new to bioinformatics, am I missing something here? Thanks!

ChaoLab commented 1 year ago

Can you send me the output folder of METABOLIC to me through email (zczhou2017@gmail.com)? I need to check the result manually to solve the issue if there is really something wrong with my code. I will keep it confidential, or you can rename the protein/genome/strain names as you want.

jampoa commented 1 year ago

Many thanks for the quick response! I have sent you the output folder via email. I also included the METABOLIC_template_and_database input, as we added some functions to hmm_table_template.txt and MW-score_reaction_table.txt before running METABOLIC-C. There is some other problems (nutrient cycling graphs missing, additional error messages in the log) which might be associated with the manually added functions

ChaoLab commented 1 year ago

Adding functions to hmm_table_template.txt and MW-score_reaction_table.txt before running METABOLIC might change the results and introduce discrepancies/mis-annotations to the final METABOLIC results. It is suggested to use the original template and db files without any further modification.

jampoa commented 1 year ago

I see, thank you very much for your help! I will re-run the analysis with the original files

jampoa commented 1 year ago

Hello again,

I re-run the analysis with the original template files as suggested which resolved the original problem.

However the nutrient cycling plots are still not created and I receive errors for the calculation of the total R and MW score community coverage with gn_cov_percentage.

In #41 it was suggested that this can happen when some MAGs don't have a GTDB classification so I checked those results but all of my MAGs have been classified. The gtdbtk install (v 2.1.0) seems fine and gtdbtk test is running without errors. Do you have any idea where this could be coming from?

[2023-02-25 02:55:27] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode. [2023-02-25 02:55:27] INFO: Done. [2023-02-25 02:55:27] INFO: Removing intermediate files. [2023-02-25 02:55:27] INFO: Intermediate files removed. [2023-02-25 02:55:27] INFO: Done. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1514. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1514. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1514. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1514. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1514. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1514. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1514. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1514. Use of uninitialized value in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1537. Use of uninitialized value in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1537. Use of uninitialized value in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1537. Use of uninitialized value in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1537. Use of uninitialized value in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1537. Use of uninitialized value in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1537. Use of uninitialized value in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1537. Use of uninitialized value in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1537. Use of uninitialized value in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1537. Use of uninitialized value in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1537. Use of uninitialized value in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1537. Use of uninitialized value in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1537. [2023-02-25 02:55:51] Drawing energy flow chart finished [2023-02-25 02:55:51] Calculating MW-score ... Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1655. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1655. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1655. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1655. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1655. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1655. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1655. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1655. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1655. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1655. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. Use of uninitialized value $cat in concatenation (.) or string at /DATA2/conda_envs/METABOLIC/METABOLIC-C.pl line 1682. [2023-02-25 02:55:55] Calculating MW-score is done

ChaoLab commented 1 year ago

It seems like something was wrong with the GTDB-Tk running or the result of GTDB-TK was not parsed correctly

jampoa commented 1 year ago

Thanks for the reply! I checked again but the log file as well as the GTDB-TK install and results seem fine... What's even weirder is that I don't reproduce this error with metabolic test, however with all of my other samples I encounter the same problem

ChaoLab commented 1 year ago

Emm...., this is strange. Is there something unusual with your genome files? For example, the file name and header line?

Here is the requirement for the genome files: Ensure that the fasta headers of each .fasta or .faa file is unique, and that your file names do not contains spaces (suggest to only use alphanumeric characters and underscores in the file names)

jampoa commented 1 year ago

Yes my filenames contain only the suggested characters and both them and the header lines are unique. The headers look like this: >c_000000000448 CGCTATACGAAGCCAAGGAACTGGGTCGCAACCGGGTGCGAAGTTATCGTCATGGTGATG

ChaoLab commented 1 year ago

How about this trick: (1) You use test data MAGs and your reads (2) You use your MAGs and test data reads This can help first find which part is wrong. I guess something is wrong with the input MAGs

jampoa commented 1 year ago

That's a great idea, thanks!

ChaoLab commented 1 year ago

That's a great idea, thanks!

I think this will work even if you get 0 read coverage for some or all your MAGs

jampoa commented 1 year ago

I ran the test and with both combinations and strangely I don't get the error anymore (the nutrient cycling diagrams are still not created though). So something seems to be going wrong when combining my MAGs and my reads, I guess it is not a problem related to the tool but to my data

jampoa commented 1 year ago

Hello, here a small update in case someone is having the same problem: I resolved the error by re-running the analysis with the updated perl script, so it was indeed linked to the parsing of the archaeal GTDB-tk results. Thanks again for the help!