dib-lab / charcoal

Remove contaminated contigs from genomes using k-mers and taxonomies.
Other
52 stars 1 forks source link

Add more specific narrative of what charcoal does #87

Open taylorreiter opened 4 years ago

taylorreiter commented 4 years ago

Still needs work, esp around talking about what sourmash is actually doing. But first pass!

Charcoal identifies and removes contamination in metagenome-assembled genomes using k-mer based methods.

  1. Identify plurality lineage a. Identify majority lineage using sourmash against database
    • Generate a signature for the entire genome with sourmash
    • run sourmash gather against a database. Collect signatures for all of the genomes from the database that are present in the signature. Requires only one hash overlap to be considered a match.
    • Read in all of the genome matches from gather. Assign taxonomy and identity to each of the signatures.
    • For each contig in the MAG, read it in, count its basepairs, and add it to the total count of contigs. Create a scaled minhash sketch of the genome and record how many hashes are in the sketch.
    • Determine lineage of the genome from plurality vote on the genome-specific LCA (get_majority_lca_at_rank).
      • Create a dictionary of hash:rank mapping (gather_assignments).
      • count number of times each rank is observed in the hashes, and find the plurality.
      • define the fraction of identifiable hashes (f_ident); e.g. the number of hashes from the entire genome that matched to any genome in the database.
      • define the fraction of identifiable hashes that are anchored to the plurality rank (f_major).
      • If fewer than 10% of hashes are identifiable, request that the user provide a lineage for the genome. If fewer than 20% of identifiable hashes anchor to the major lineage, request that the user provide a lineage for the genome. b. Try user-defined lineage.
      • If the user defines a lineage, report whether the provided lineage agrees with k-mer classification at or above the genus level.
  2. Remove contaminants that do not have the same rank as the majority lineage. a. for each contig, create a minhash from it. b. Check whether it contains greater than or equal to the minimum number of hashes required for a gather match (can be as low as 1). If it does not, report the contig as "missing" and assign it as clean. c. If >= minimum number of hashes required for a gather match check whether the contig is clean or dirty.
    • REASON 1: determine whether there are sequences in the contig that belong to a genome in a different lineage. Run sourmash gather against the LCA database created in step 1. Get the lineage for the match. If the match it inside of the specified rank, the contig is clean. If it is outside of the specified rank, it is dirty.
    • REASON 2: determine whether there are sequences in the contig that belong to multiple lineages, e.g. arising from lateral gene transfer (including phage and plasmid), mistakes in taxonomy, highly conserved sequences, or other events that lead to shared sequences across taxonomically distinct lineages.).
    • REASON 3: determine whether there are sequences in the contig that have majority LCA assignments outside of the match rank
  3. Report the gather breakdown of the clean contigs, and report the size and match lineage of the primary gather match to these clean contigs.
ctb commented 4 years ago

is this for the documentation, or for a paper, or "just" for understanding? I'm happy to help take it in any of those directions :)

ctb commented 4 years ago

I read through it ✅

taylorreiter commented 4 years ago

I went to start writing up what charcoal does, and realized that I didn't know exactly how it was doing what it does. So I read the code and wrote down what each step does. I don't think this necessary belongs in the documentation, but it did help me understand how charcoal works and so I thought it could be helpful to put it somewhere where others can see it, in case they don't want to take the time to read the code.

ctb commented 4 years ago

Got it! Random thoughts --