ericcapo / marky-coco

Ready-to-use pipeline to detect, count and identify the hgcAB genes, involved in mercury methylation, from metagenomes
9 stars 1 forks source link

marky-coco

Marky-coco is a ready-to-use pipeline to detect and identify hgcAB genes from raw paired-end and single end fastq files. This pipeline is a collaborative project from researchers of the Meta-Hg working group and is paired to the hgcAB gene catalogue Hg-MATE database. The metagenomes are processed with a suite of sofware: fastp to trim and clean the raw reads, megahit for de-novo assembly, bowtie2 to map the cleaned reads to the de-novo assembly, prodigal to predict protein-coding genes, featureCounts to count the number of reads associated to each gene. Finally workflow/genesearch.sh is a custom bash script allowing to detect hgc gene homologs and extract their features (coverage values, taxonomy, amino acid sequences). In the current version of marky-coco, the script is also providing outputs with detected merA and merB gene homologs (but with no tips yet for manual inspection and taxonomic identification). A step-by-step tutorial is included in the folder tutorial => download the .html and open it on google chrome or firefox. ![Marky-coco_workflow](https://user-images.githubusercontent.com/10795529/213127826-77844383-3a59-41b3-80f6-b7e3ab6b2ae9.png)

INSTALL

git clone https://github.com/ericcapo/marky-coco.git
cd marky-coco
conda env create -f environment.yml

To install conda, read instructions here https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html


BASIC USAGE WITH PAIRED END METAGENOMES


BASIC USAGE WITH SINGLE END METAGENOMES


RUN MARKY-COCO WITH TEST METAGENOMES

Download test files MG01 (paired-end fastq files) and MG02 (single-end fastq file)

wget https://figshare.com/ndownloader/articles/19221213/versions/2
unzip 2


SLURM USAGE

Run the marky_to_slurm file. See how slurm work on your computer/servor here https://blog.ronin.cloud/slurm-intro/


ADVANCED USAGE WITH INTERMEDIATE FILES

Standards input files are fastq files but intermediate files can be used because the pipeline in based on a snakemake structure. For marky-coco to work this way, you would need to put your files in the sample_tmp folder as following:


OUTPUT FILES


RECOVER HGC FROM GENOMES (ISOLATED, SAGS AND MAGS)

"NEW SINCE 30 JUNE 2023To detect the presence of hgc genes in your genomes, you only need the script detect_hgc_from_fna.sh, the db folder of marky-coco and a folder with all your genomes in fna format.

The output files are in the folder outputs_hgcA and outputs_hgcB. Each output file provide you a list hgcB homologs found in each genome of interest. Columns includes the id of the genes, the amino acid sequences, genome_id and gene information (hgcA homolog or hgcB homolog). You still have to confirm if there are true hgcA and hgcB genes using the criteria described in the section below "IMPORTANT NOTES FOR DATA INTERPRETATION". To screen quickly for true genes, you go in the outputs_hgcA folder and then use the code "grep -rine NVWCAAGK" so it will show you which genome include hgcA genes. Try the 6 amino acid motifs : NVWCAAGK, NVWCASGK, NVWCAGGK, NIWCAAGK, NIWCAGGK or NVWCSAGK. Find paired hgcB genes by looking at the gene_id (the last number of your gene_id should be closed to hgcA gene e.g. k141_149_2 and k141_149_3). This script do not provide any information about the abundance/coverage of this gene in the genome, neither a specific taxonomy against Hg-MATE database. This is giving you merely the presence of this gene pair in your genome(s).


IMPORTANT NOTES FOR DATA INTERPRETATION

This software and insights into data interpretation are presented in the paper "A consensus protocol for the recovery of mercury methylation genes from metagenomes" Molecular Ecology Resources doi: 10.1111/1755-0998.13687.

R
> db <- read.table("db/db_txid_220220.txt",h=T)
> a <- read.table("{sample}_outputs/{sample}_hgcA_final.txt", h=T)
> b <- merge(db, a, by="txid", all.x=F, all.y=T)
> write.table(b, file="{sample}_outputs/{sample}_hgcA_final2.txt", sep="\t", row.names=F)
> quit()


METHODS

The detection, counting and taxonomic identification of hgcAB genes was done with marky-coco. The metagenomes were trimmed and cleaned using fastp (Chen et al. 2018) with following parameters: -q 30 -l 25 --detect_adapter_for_pe --trim_poly_g --trim_poly_x. A de novo single assembly approach was applied using the assembler megahit 1.1.2 (Li et al 2016) with default settings. The annotation of the contigs for prokaryotic protein-coding gene prediction was done with the software prodigal 2.6.3 (Hyatt et al 2010). The DNA reads were mapped against the contigs with bowtie2 (Langdmead and Salzberg 2012), and the resulting .sam files were converted to .bam files using samtools 1.9 (Li et al 2009). The .bam files and the prodigal output .gff file were used to estimate read counts by using featureCounts (Liao et al 2014). In order to detect hgc homologs, HMM profiles derived from the Hg-MATE.db.v1 were applied to the amino acid FASTA file generated from each assembly with the function hmmsearch from hmmer 3.2.1 (Finn et al 2011). The reference package ‘hgcA’ from Hg-MATE.db.v1 was used for phylogenetic analysis of the HgcA amino acid sequences. Briefly, amino acid sequences from gene identified as hgcA gene homolog were (i) compiled in a FASTA file, (ii) aligned to Stockholm formatted alignment of hgcA sequences from the reference package with the function hmmalign from hmmer 3.2.1 (iii) placed onto the HgcA reference tree with the function pplacer and (iv) classified using the functions rppr and guppy_classify from the program pplacer (Masten et al. 2010).