Refining bins using the coverage of individual genes across samples

evanroyrees commented 4 years ago

Tasks

all PCGs BLAST all-vs-all search - diamond.py
gene sets construction
real alignments to respective sample contigs specific to gene coordinates
module detection of coverage-correlated genes (hierarchical clustering)
bin refinement
species pangenomic gene inventory
read reassembly from pangenome set alignments
genome rearrangement detection
sequence variant detection
species haplotype detection (differential coverage analysis)
Exception handling if poor coverage correlations are prevalent subset modules prior to clustering

Expectations and Approach

Avoid assumptions on the homogeneity and presence of species within the microbiome
Instead of co-expressed genes, look for genes with similar coverage across the sample set.
Identify pairs of genes with >95% identity.
- Collect the predicted protein-coding genes within both binned and unbinned contigs do an all-vs-all nucleotide BLAST search. (This approach will identify groups of species-level gene homologs across samples that we will use in subsequent analysis)
Calculate coverages of each of these gene homolog groups.
- Align reads to gene coordinates within the respective sample’s contigs, counting only reads >95% identity to the reference sequence.
- Normalize coverage sequencing depth and gene length by calculating RPKM (reads per kilobase of gene per million mapped reads).
Identify modules of coverage-correlated genes
- Use a variation on weighted gene co-expression network analysis (WGCNA).
  - Pearson correlation of pairs of gene groups will be calculated.
  - Negative correlations will be assumed to be irrelevant in this context and discarded
  - Positive correlations will be kept
  - Correlations will only be calculated for samples where both genes occur (i.e. zero values are ignored)
- Calculate adjacency from Pearson correlations using the power adjacency function
- Determine topological overlap matrix dissimilarity using the calculated adjacency score
- average-linked (ward D or D2, we will need to contemplate regularization in this context) hierarchical clustering of dissimilarity measures
- Calibrate module detection, i.e. setting the cutoff....
  - This step is the most important and perhaps most difficult
  - Could possibly use pvclust here since the statistically significant clustering is performed with bootstrapping already baked in.
  - Will require use of connectivity information (statistical metrics within the linkage array)
- The identified modules will then be compared to the bins in respective metagenomes, and split or merged as necessary.
Collect a pangenomic set of genes for each species present in the sample set
Construct an accurate picture of the pangenome in the sample set
- Yield a complete picture of the variance in gene inventories for particular species in the samples.
- Use genes not found in specific samples as references.
- (This will provide easy access to display and assess variable genome regions in microbial ecology).
Obtain better quality assemblies
- reassemble reads from single samples that align to specific pangenomes
Characterize strain heterogeneity
- Detect genome rearrangements
  - use our identified gene sets
  - use connectivity information in multiple metagenomic assemblies
- detect sequence variants within samples
  - quantification of heterogeneity as a function of environmental conditions under study
Reconstruct variant haplotypes
- Determine abundance variation across the datasets
Examine strain heterogeneity in single samples before coverage clustering, and then use a mixture model to correct coverage correlation and clustering
- Only necessary if poor coverage correlation is prevalent amongst results for test datasets.
We could also determine which subsets of gene groups need to be clustered.
- Network of all genes in a sample set that are either related by identity (intersample)
- or binning/contig assembly (intrasample).
- Identify discrete subsets to cluster
  - A breadth first search could be used to identify all reachable nodes from all genes.
- If this results in bin fragments not being merged, we would identify likely similar bins to co-cluster (for example, on the basis of taxonomy or nucleotide composition similarity).

Evaluation Datasets:

Several new mixtures with defined molar ratios of DNA from the 51 bacteria that were separately extracted.
Real environmental metagenomes from the TARA oceans project that were used in the previous assembly of several Prochlorococcus strains
Real complex soil samples (see Aim 3)
Simulated data of rare species and genes for assessing module detection accuracy.
- devise criteria to decide whether to keep binned genes if coverage is incongruent

chasemc commented 3 years ago

In out last meeting we decided to work off of dev since it is so far ahead of the main branch.

I've created a branch nsf-2 for this project: https://github.com/KwanLab/Autometa/tree/nsf-2

We also decided that the end goal would be an implementation of this Aim as its own Nextflow module (with process logic contained in Autometa Python endpoints) that also interfaces within the larger Autometa Nextflow pipeline. Simple external software like all-v-all Diamond BLAST will be wrapped in Nextflow only.

evanroyrees commented 3 years ago

I think maybe we should keep this branch off of KwanLab and have it on our own forked repos. This way we do not confuse any end-users. Otherwise we can push this branch as a PR to the KwanLab when ready. We've been using the article by Vincent Driessen for reference.

git branching model

Upon revisiting, I think nsf-2 is appropriate, I'm just worried about confusing the end-users. Although maybe this is best?

chasemc commented 3 years ago

As a note- there has been some offline discussion about this. https://github.com/KwanLab/Autometa/issues/13#issuecomment-800423310

I would say work off KwanLab/Autometa@nsf-2. I'm not sure why a non-"main" branch would confuse end-users, especially since it follows the paradigm in the image?

chasemc commented 3 years ago

Notes from initial pseudo-code session: First test data -> MIX51-EQUAL

{nextflow pseudo-code}


Channel -> concatenate all orf fastas
    Keep track of which orf belongs to which contig and metagenome

Process create_blast_database {
Input: 
    all orfs from every sample
}

Process all-v-all-blast {
Input: 
    Orf to contig to metagenome database/table/dictionary
        all orfs from every sample
        Blast database
Output: 
    Filtered BLAST table
        Filter self-hit (is this a setting in diamond?)
               (should we limit to only results from different metagenome samples?)
}

Process identify_gene_homologs{
Input: 
    BLAST table
Output:
    Clusters of orfs
         Intrasample hits -> orf to orf within the same sample
         Intersample hits -> sample x orf to hit to sample y orf
}

Process calculating_orf_coverage{
Input: 
          Filtered clusters of orfs
         Contigs containing those orfs
         Reads 
Output:
    Read alignments to orfs/contigs + counts
}

Process cluster_based_on_coverage {
Input:
        Read alignments to orfs/contigs + counts
        Val metagenome_depth (for normalizing coverage)
Output:
Script: “ step 2 in aim 2”
}

chasemc commented 3 years ago

@WiscEvan Is there a good way to get ORF-level coverage? If so are we going to if/else whether input is x/y; or require input file(s) = type x?

chasemc commented 3 years ago

Answered my own question #1, but we will have to decide on the second question

KwanLab / Autometa