KwanLab / Autometa

Autometa: Automated Extraction of Genomes from Shotgun Metagenomes
https://autometa.readthedocs.io
Other
39 stars 15 forks source link

Refining bins using the coverage of individual genes across samples #13

Open evanroyrees opened 4 years ago

evanroyrees commented 4 years ago

Tasks

  1. all PCGs BLAST all-vs-all search - diamond.py
  2. gene sets construction
  3. real alignments to respective sample contigs specific to gene coordinates
  4. module detection of coverage-correlated genes (hierarchical clustering)
  5. bin refinement
  6. species pangenomic gene inventory
  7. read reassembly from pangenome set alignments
  8. genome rearrangement detection
  9. sequence variant detection
  10. species haplotype detection (differential coverage analysis)
  11. Exception handling if poor coverage correlations are prevalent subset modules prior to clustering

Expectations and Approach

Evaluation Datasets:

  1. Several new mixtures with defined molar ratios of DNA from the 51 bacteria that were separately extracted.
  2. Real environmental metagenomes from the TARA oceans project that were used in the previous assembly of several Prochlorococcus strains
  3. Real complex soil samples (see Aim 3)
  4. Simulated data of rare species and genes for assessing module detection accuracy.
    • devise criteria to decide whether to keep binned genes if coverage is incongruent
chasemc commented 3 years ago

In out last meeting we decided to work off of dev since it is so far ahead of the main branch.

I've created a branch nsf-2 for this project: https://github.com/KwanLab/Autometa/tree/nsf-2

We also decided that the end goal would be an implementation of this Aim as its own Nextflow module (with process logic contained in Autometa Python endpoints) that also interfaces within the larger Autometa Nextflow pipeline. Simple external software like all-v-all Diamond BLAST will be wrapped in Nextflow only.

evanroyrees commented 3 years ago

I think maybe we should keep this branch off of KwanLab and have it on our own forked repos. This way we do not confuse any end-users. Otherwise we can push this branch as a PR to the KwanLab when ready. We've been using the article by Vincent Driessen for reference.

git branching model

Upon revisiting, I think nsf-2 is appropriate, I'm just worried about confusing the end-users. Although maybe this is best?

chasemc commented 3 years ago

As a note- there has been some offline discussion about this. https://github.com/KwanLab/Autometa/issues/13#issuecomment-800423310

I would say work off KwanLab/Autometa@nsf-2. I'm not sure why a non-"main" branch would confuse end-users, especially since it follows the paradigm in the image?

chasemc commented 3 years ago

Notes from initial pseudo-code session: First test data -> MIX51-EQUAL

{nextflow pseudo-code}


Channel -> concatenate all orf fastas
    Keep track of which orf belongs to which contig and metagenome

Process create_blast_database {
Input: 
    all orfs from every sample
}

Process all-v-all-blast {
Input: 
    Orf to contig to metagenome database/table/dictionary
        all orfs from every sample
        Blast database
Output: 
    Filtered BLAST table
        Filter self-hit (is this a setting in diamond?)
               (should we limit to only results from different metagenome samples?)
}

Process identify_gene_homologs{
Input: 
    BLAST table
Output:
    Clusters of orfs
         Intrasample hits -> orf to orf within the same sample
         Intersample hits -> sample x orf to hit to sample y orf
}

Process calculating_orf_coverage{
Input: 
          Filtered clusters of orfs
         Contigs containing those orfs
         Reads 
Output:
    Read alignments to orfs/contigs + counts
}

Process cluster_based_on_coverage {
Input:
        Read alignments to orfs/contigs + counts
        Val metagenome_depth (for normalizing coverage)
Output:
Script: “ step 2 in aim 2”
}
chasemc commented 3 years ago

@WiscEvan Is there a good way to get ORF-level coverage? If so are we going to if/else whether input is x/y; or require input file(s) = type x?

chasemc commented 3 years ago

Answered my own question #1, but we will have to decide on the second question