Metabolic network diagrams

slaperriere commented 4 years ago

Hello,

Is it possible to construct the metabolic network diagrams at taxonomic levels other than phylum? Such as order, family etc?

Thank you, Sarha

patriciatran commented 4 years ago

Hi @slaperriere ! That is a great idea, thank you for the suggestion.

I'm thinking maybe the default script from METABOLIC-C would make the phylum level plot to make it simpler, but even as of now, all the R script can be run as standalone as well. There, the user could choose add an argument "--level" and choose one of from "phylum", "class", "order", "family", "genus", "specie". They could even run the script a bunch of times if they want plots for all those taxonomic levels.

When you are thinking of lower taxonomic level, would you be interested in for example if you select class :

the script would make a PDF for each phyla, and show all the chosen classes (or orders ,etc.) under that phyla. Therefore you would get multiple PDFs (one for each phyla) OR
the script would make only 1 PDF with all the different chosen classes (or orders, etc.) on one single plot. OR
something else?

Thanks!

Patricia

slaperriere commented 4 years ago

Hi Patricia,

Thank you for your quick response! I can play around with the R script. I would like to make one plot, like the phylum level plot, but with different taxonomic ranks as the columns. For example, if I have a bunch of different SAGs, plot the contribution for each class. Or if I am looking at SAGs all of SAR11, the contribution of the different SAGs.

Thanks! Sarah

patriciatran commented 4 years ago

Hi Sarah @slaperriere ,

Good idea! I will add that to the list of features. Here are some tips for you if you want to play around with the R script to make the plot in your reply:

The file Metabolic_Network_Input.txt that is created by METABOLIC is a data frame that has x rows, and 5 columns. Column 4 is the Taxonomic.Group which is pretty much the phyla. You could modify the table to include 10 columns overall by adding the additional levels of classifications for each MAG (each row) and filling in corresponding taxonomic information(shown in blue here).

The screenshot below are the types of modification you can make to Metabolic_Network_Input.txt to use as an input in R.

Don't save lines 1,2 and 4 on the screenshot though, they are just to make my point.

Then, in the R script, you can use the line 36 to 66 to generate your figure. Remove the # on line 36 and change the path to where your new table is. On line 49 and 55, instead of my_graph <- table[,c(2,3,4,5)], type my_graph <- table[,c(2,3,5,10)]. (Because column 5 is Class, and column 10 is Coverage). On line 59, instead of aes(width = Coverage.value.average., color=as.factor(Taxonomic.Group))), type aes(width = Coverage.value.average., color=as.factor(Class)))

Did I understand your need correctly??

Hope that helps!

Patricia

slaperriere commented 4 years ago

Thank you so much for the help! I will give it a try once it finishes running. Any idea how long it will take for about 600 SAGs and 30 fastq with 30 threads?

Also, does this script generate 'MN-score_result.txt'?

patriciatran commented 4 years ago

You're welcome! It depends on your computer/server specs, but 30 threads would be maybe ~24 hours, but I'm not certain? The rate limiting step would be the mapping of the reads to the SAGs. If you try it out, I'd be curious to hear how long it took.

Are your 30 fastq paired reads for 15 samples? or 30 individual samples? Depending on your research question, you might want to consider splitting your runs to simplify your interpretation on the plots and results.

For the MN-score_results.txt, that is done in METABOLIC-C.pl from Lines 1421 to 1541.

slaperriere commented 4 years ago

It seems to be progressing fairly quickly. Though I have gotten several errors so far along the way. When running:

perl /home/sarah/software/METABOLIC/METABOLIC-C.pl -in-gn /home/sarah/biogeography/GA03_SAGs_nocyano_derep/dereplicated_genomes \
                                                    -o METABOLIC_above \
                                                    -r read_path_GA03_tag_above.csv \
                                                   -t 30

Error 1:

rm: cannot remove 'temp': No such file or directory
Traceback (most recent call last):
  File "/home/sarah/software/METABOLIC/Accessory_scripts/hmmscan-parser-dbCANmeta.py", line 36, in <module>
    with open('temp') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'temp'

Error 2: For all of the reverse reads:

stat: Bad file descriptor
" for reading; skipping...ad file "GA03_above_reads/SRR5788006_fastp_rev.fastq
Error: No input read files were valid
(ERR): bowtie2-align exited with value 1
[E::hts_open_format] Failed to open file "METABOLIC_above/All_gene_collections_mapped.1.sorted.bam" : No such file or directory
samtools index: failed to open "METABOLIC_above/All_gene_collections_mapped.1.sorted.bam": No such file or directory
rm: cannot remove 'METABOLIC_above/All_gene_collections_mapped.1.sam': No such file or directory

my read files looks like

GA03_above_reads/SRR5788006_fastp_fwd.fastq,GA03_above_reads/SRR5788006_fastp_rev.fastq GA03_above_reads/SRR5788012_fastp_fwd.fastq,GA03_above_reads/SRR5788012_fastp_rev.fastq GA03_above_reads/SRR5788013_fastp_fwd.fastq,GA03_above_reads/SRR5788013_fastp_rev.fastq

Sorry for all the questions! The test runs finished with no errors.

ChaoLab commented 4 years ago

Hi Sara, @slaperriere For Error 1, I guess it might be because of the right to create a directory named 'temp' in your server. Can you check it? You could just check this python script solely. For Error 2, I think you need to give the full path of the reads in the read address txt file, like: root/directory1/directory2/GA03_above_reads/SRR5788006_fastp_fwd.fastq,root/directory1/directory2/GA03_above_reads/SRR5788006_fastp_rev.fastq

slaperriere commented 4 years ago

Hello,

Adding the full path for the reads produced the same error.

The error with hmmscan-parser-dbCANmeta.py, it looks like all the output files were generated? I have all the results worksheets. But the error occurs when processing many genomes ~600. The error did not occur when testing 3 genomes.

Thanks, Sarah

ChaoLab commented 4 years ago

1) I guess there are some minor problems when the Perl script wants to use the reads. It could be some trivial things of pointing the read place of reads 2) Is it because that the space where 'temp' was generated is limited?

ChaoLab commented 3 years ago

Hello,

Adding the full path for the reads produced the same error.

The error with hmmscan-parser-dbCANmeta.py, it looks like all the output files were generated? I have all the results worksheets. But the error occurs when processing many genomes ~600. The error did not occur when testing 3 genomes.

Thanks, Sarah

Hi Sarah, for the 'temp' problem, it is because the py script ("hmmscan-parser-dbCANmeta.py") is not changed to do the parallel parsing for multiple MAGs. Here, I attached a new one for replacing. And also, I have updated the GitHub deposition "Accessory_scripts.tgz". Hope this will solve the problem.

"hmmscan-parser-dbCANmeta.py.txt" is attached here (delete "txt" to use). hmmscan-parser-dbCANmeta.py.txt

slaperriere commented 3 years ago

Hello,

Thank you! I will give that a try.

Also, I noticed the metabolic_results.txt file is missing some hits. For example, I annotated some MAGs with emapper, which found K10944+K10945+K10946 for nitrification. However, the METABOLIC_results.txt is saying they are absent. Any idea what could be causing the discrepancies?

Thank you, Sarah

ChaoLab commented 3 years ago

Hi, amoABC/pmoABC are complicated ones. Here in METABOLIC, we used a step trying to distinguish amo and pmo genes. While to accurately classify them, it's better to build phylogenetic trees using both amo and pmo reference genes.

Emapper could have some differences comparing to KEGG Kofam. It is related to the HMMs and their searching settings.

If you are doing a large batch analysis, I think our result with an additional validation step is kind of good to balance the dataset size and annotation accuracy.

All the best! Chao

AnantharamanLab / METABOLIC

Metabolic network diagrams #26