AnantharamanLab / METABOLIC

A scalable high-throughput metabolic and biogeochemical functional trait profiler
181 stars 45 forks source link

Avoiding taxonomic inference - or providing your own #61

Open tenguzame opened 2 years ago

tenguzame commented 2 years ago

Hi everybody, I was trying to utilize the tool on my 38 MAGs, for which I already have carried out taxonomic inferences independently. Is there a way to prevent METABOLIC-C from re-computingtaxonomy for my genomes, or to feed it my own data? I don't need a taxonomic breakdown of the metabolic inferences, I'd more like to check how my genomes interact within their community. Best regards

ChaoLab commented 2 years ago

Hi, You will need to provide the gtdb-tk result for the downstream visualization steps. Maybe you can suppress gtdb-tk and just make a folder named "gtdbtk_Genome_files" with self-made mocked gtdb-tk taxonomy files.

The gtdb-tk tax files that METABOLIC will use: ../../intermediate_files/gtdbtk_Genome_files/gtdbtk.bac120.summary.tsv ../../intermediate_files/gtdbtk_Genome_files/gtdbtk.ar122.summary.tsv

tenguzame commented 2 years ago

Hi, You will need to provide the gtdb-tk result for the downstream visualization steps. Maybe you can suppress gtdb-tk and just make a folder named "gtdbtk_Genome_files" with self-made mocked gtdb-tk taxonomy files.

The gtdb-tk tax files that METABOLIC will use: ../../intermediate_files/gtdbtk_Genome_files/gtdbtk.bac120.summary.tsv ../../intermediate_files/gtdbtk_Genome_files/gtdbtk.ar122.summary.tsv

Thank you for your quick response! So, I could do the following: 1) run METABOLIC-C with the genomes I need; 2) add the taxonomic inferences to the intermediate files; 3) re-run METABOLIC-C. Would it be correct?

ChaoLab commented 2 years ago

Yes, when you do the re-run (step 3 as you listed), you can use this script: "https://github.com/AnantharamanLab/METABOLIC/blob/master/METABOLIC-C.2nd_run.pl". See the instruction here: https://github.com/AnantharamanLab/METABOLIC/wiki/METABOLIC-Usage#a-2nd-metabolic-c-run

tenguzame commented 2 years ago

Hey, thanks for the information. I fixed a couple of issues with the dependencies (which weren't fully installed) and managed to reproduce the summary files: in this way, METABOLIC-C manages to produce much of its outputs. However, since the first run didn't produce one specific PDF file (a "network.plot.pdf") as an output on energy analysis, and the second run specifically appears to look for it, I'll try to do a new first run (which now would likely end still better) and then add the taxonomic output prior to the second run. Fingers crossed!

ChaoLab commented 2 years ago

Hi, you can use conda to install all the dependencies in a much easier way (will be fully installed): https://github.com/AnantharamanLab/METABOLIC/wiki/Installation#-quick-installation

tenguzame commented 2 years ago

I know, thank you, that's the way I installed it. However, a couple of Perl dependencies were not installed and this was leading to errors. Now I fixed it following the thread from which the quick Conda installation came.

tenguzame commented 2 years ago

Just an update: the first step in the pipeline was completed but, as expected, it complained about the lack of genome taxonomy information. Again, as the final metabolic network appears to need the taxonomic information, it was not produced. Thus, I'm experimenting with its code (I have no coding knowledge, though): I'm commenting out the GTDBtk step and substituting the related input files with the ones I already have generated. If it works, I'm done. I guess it can be turned into some kind of flag to be added, something like -taxinfo - how to carry out taxonomic inference: "gtdb" (default) to run the main GTDBtk annotation, "own" to provide your own -taxinfo_path - when using -taxinfo "own" , declare the /path/to/taxonomic/input - GTDB-like format

patriciatran commented 2 years ago

Hi @tenguzame , Are you trying to draw the network diagram with the nodes and colors? (Just making sure I understand your issue) (Functional Network slide here):

If so, you can run the following Rscript on its own in the terminal without having to run any METABOLIC-C steps. Rscript draw_functional_network_diagram.R [path to energy flow input] [name of output folder]

Make sure you have the full path to draw_functional_network_diagram.R if needed. and that these Rpackages are installed: library(ggraph) library(igraph) library(tidyverse) library(tidygraph)

The energy flow input looks something like this. It's a 5 column text file, tab separated. Would changing the column "Taxonomic" column to your own annotations help solve your issue?

Energy_flow_input.txt

Edit: If you want to use your own genome coverage values (let's say METABOLIC-C doesn't work or you prefer to use another mapping method than bowtie2, you'd need to change the values in the column "Coverage value(average)" accordingly as well.