Here we present the pipeline for computing novel gene families from the proteomes of a collection of genomes, and how to calculate their genomic context conservation. Alternatively, you may also locate which FESNov gene families are present in your sequences by running eggNOG-mapper (http://eggnog-mapper.embl.de/) on any fasta file, selecting the Novel Families database.
The starting point of this pipeline consist a the concatenated fasta file with the gene predictions for the genomes of interest. The scripts presented here assume that the gene names are formatted as >genome_source_of_isolation@genome_name@gene_name@domain|phylum
.
Run MMseqs2 for calculating the gene families on the concatenated proteomes of the genomes of interest (we used the --min-seq-id 0.3 -c 0.5 --cov-mode 1 --cluster-mode 2 -e 0.001
parameter combination).
We then mapped the gene families against the following reference databases for isolating those exclusive on uncultivated taxa.
EggNOG with eggnog-mapper. We used the following parameter combination: emapper.py -m diamond --itype proteins --no_file_comments -i multifasta.faa -o mappings.tab
. We considered as significant any hit with E-value < 1e-3.
PfamA with the command hmmsearch --tblout mappings.tab /data/Pfam/Pfam-32-A/Pfam-A.hmm multifasta.faa
. We considered as significant any hit with E-value < 1e-5.
All the hmms within the PfamB collection with the command hmmsearch --tblout mappings.tab PfamB.hmm multifasta.faa
. We considered as significant any hit with E-value < 1e-5.
RefSeq with the command diamond blastx -d /data/RefSeq/refseq.dmnd -q multifasta.cds -o results.tab --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp scovhsp qlen slen --sensitive
. Hits with an E-value < 1e-3 and query coverage > 50% were considered significant
We considered as gene families exclusive from uncultivated taxa those with no member with significant hits to any of these databases.
For dealineating the unknown gene family predictions of higher quality, we conducted a series of filtering steps, for which we needed to create individual fasta files with the i) CDS sequences and ii) Protein sequences of each gene family.
clustalo -i gene_family_protein_fasta.faa -o gene_family_protein_fasta.alg.faa
After having collected all the protein alignments, domain conservation can be calculated by running:
python calculation_conserved_domain.py paths_algs.txt
We discarded gene families with a conserved domain shorter than 20 residues.
hmmsearch --tblout mappings.tab PVOG.hmm multifasta.faa
We considered hits with E-value < 1e-5 and minimum coverage of 50% as significant, and discarded families with significant hits.
hmmsearch --cut_ga --tblout mappings_antifam.tab AntiFam.hmm multifasta.faa
We discarded families with any hit with E-value < 1e-5.
ete3 build -a gene_family_protein_fasta.faa -n gene_family_CDS_fasta.cds -o output_dir --nt-switch-threshold 0.0 --noimg -w clustalo_default-none-none-none
Later, we ran
-- hyphy BUSTED (http://vision.hyphy.org/) for calculating the dN/dS of each gene family:
hyphy busted --alignment gene_family_CDS_fasta.alg.cds --tree gene_family.nw
-- RNAcode (https://github.com/ViennaRNA/RNAcode) for calculating their coding probability. Before running RNAcode, alignments need to be changed to MAF format, which can be done in python by importing SeqIO from the Bio package (SeqIO.parse(sys.argv[1], "fasta"); SeqIO.write(records, sys.argv[1]+'.maf', "maf")
)
RNAcode gene_family_CDS_fasta.alg.cds.maf -t --stop-early -o RNAcode_out.tab
We only considered in our analysis novel gene families with a conserved domain of at least 20 residues, with no significant homolgy in the pVOGs and Antifam databases, , with dN/dS < 0.5 and with RNAcode p-value < 0.05.
For reconstructing the genomic context of the novel gene families, we followed the subsequent steps:
python genes_per_contig.py paths_gffs.txt > neighs_per_contig.tab
The paths_gffs.txt
file contains the gff files of all the genomes with the gene prediction information, for instance:
Chip-388_95C1R_METABAT_1 prokka gene 85 783 . + . ID=Chip-388_95C1R_METABAT_00001_gene;Name=tssA_1;gene=tssA_1;locus_tag=Chip-388_95C1R_METABAT_00001
Chip-388_95C1R_METABAT_1 prokka mRNA 85 783 . + . ID=Chip-388_95C1R_METABAT_00001_mRNA;Name=tssA_1;gene=tssA_1;locus_tag=Chip-388_95C1R_METABAT_00001
Chip-388_95C1R_METABAT_1 Prodigal:002006 CDS 85 783 . + 0 ID=Chip-388_95C1R_METABAT_00001;Parent=Chip-388_95C1R_METABAT_00001_gene,Chip-388_95C1R_METABAT_00001_mRNA;eC_number=2.8.1.1;Name=tssA_1;db_xref=COG:COG2897;gene=tssA_1;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:D4GYM0;locus_tag=Chip-388_95C1R_METABAT_00001;product=Putative thiosulfate sulfurtransferase;protein_id=gnl|DEEM|Chip-388_95C1R_METABAT_00001
Upload the neighs_per_contig.tab
data to a Mongo (https://www.mongodb.com/) collection with python neigh2json.py neighs_per_contig.tab | mongoimport --host HOST_NAME -d DATABASE_NAME -c neighs --drop
.
Create a Mongo collection with the functional annotation of the neighbors of the members of each gene family: python emapper2json.py eggnogmapper_out.tab | mongoimport --host HOST_NAME -d DATABASE_NAME -c emapper2 --drop (emapper-2.1.5 output)
.
It is worth noting that creating Mongo databases may not be needed if working with a relatively low number of genomes. Then, the genomic context and gene functional information may be directly loaded into memory.
Calculate genomic context conservation:
1) Precompute:
python neigh_cons_score.py gene_family_composition.tab > scores.tab
. The gene_family_composition.tab
file is a tab delimeted file with 3 columns: the gene family name, number of members and coma-separated list of members. Then run python genomic_context_conservation_table.py scores.tab > final_scores.tab
for generating a final report of genomic context conservation per gene family. The fields included in the final table are:
gebe_family_name
db (database from which the functional term comes from)
functional_term
relative_position (genomic position relative to the gene family members, e.g. -1 means gene immediate before the gene family members)
vertical score (% of neighboring genes in the position specified annnotated to the functional_term specified)
% of neighbors in the position & with the functional_term in contrary strand to the gene family members. For each position and functional term for which a score is provided, proportion of gene neighbors that are in contrary orientations than the gene family members. This informs of whether neighbors are in the same strand as the gene family members or not. 1 means that all neigbors in the position specified & showing the functional term are in contrary orientation.
number of neighbors in contrary strand in between the novel genes and the genes in the position & with the functional_term. In case the position is not adjaccent to the gene family members (positions +-1), it may be the case that there are genes in between the gene family members and the genes in the position reported that are in contrary strand and break the operon structure. This field informs about this.
% genes separated more than 100nts in between the novel gene family members and the genes in the position & with the functional_term, description of the functional term. This field represents the number of genes in between the target position and the gene family members which are more than 100nts away. Higher values indicate that separation between genes is generally higher than 100nts.
This is how the table looks like:
GTDBiso@GB_GCA_003141455@PLMA01000055.1_15@d__Bacteria|p__Chloroflexota NOVPT00U og COG0627 -3 0.14285714285714285 0.0 0.0 0.0 Serine hydrolase involved in the detoxification of formaldehyde
GTDBiso@GB_GCA_003141455@PLMA01000055.1_15@d__Bacteria|p__Chloroflexota NOVPT00U kpath 01130 2 0.7142857142857143 0.0 0.0 0.0
GTDBiso@GB_GCA_003141695@PLNE01000019.1_4@d__Bacteria|p__Dormibacterota NOV6H718 kpath 00020 -3 0.375 1.0 1.0 1.0 Citrate cycle (TCA cycle)
The genomic_context_confidence.py
code was used to estimate the confidence of the KEGG pathway functional assignations based on genomic context, by measuring how different genomic architectures could correctly predict pathways on known function genes. The script reads from the final_scores.tab
table, and also needs a table with confident KEGG pathway annotations per gene family (Fields: family_name,db from where the functional annotation come from,number of gene family members,funct_annot,n_genes_annot_as_funct_annotated_to_the_funct_annot_specicied,funct_annot_desc
). For each of these gene families, the script tests whether different genomic context conservation thresholds correcly predict the original KEGG pathway of the gene family.
For running this step, you need to compute the taxonomy of the genomes with GTDB-tk. For calculating the taxonomic coverage and specificity for each gene family on each taxonomic group, we used:
python tax_cov_sp.py genome_tax_annotation.tab gene_family_composition.tab > sp_cov_per_fam_per_lin.tab.
. The genome_tax_annotation.tab
file is a tab-delimeted file with 2 columns: genome name and their GTDB (https://gtdb.ecogenomic.org/) taxonomic annotations. sp_cov_per_fam_per_lin.tab
is a tab-delimted file containing the following columns: gene family name, GTDB lineage, specificity, coverage, number of genomes within the GTDB lineage in the collection.
For testing whether novel families could discriminate between conditions (i.e. control and colorectal cancer samples), we used the CRC prediction.R
, which reads an abundance table, with the abundance of each gene family in each sample, and calculates their prediction power using logistic models and machine learning algorithms.
For detecting novel gene families over or under represented in particular conditions, we ran the CRC_DA_analysis.r
code, which reads a gene family abundance table per sample, and tests for differential abundance by calculating p-values by Wilcoxon test adjusted by the FDR method for multiple hypothesis testing.