KwanLab / Autometa

Autometa: Automated Extraction of Genomes from Shotgun Metagenomes
https://autometa.readthedocs.io
Other
40 stars 15 forks source link

Animations in methods documentation #172

Open evanroyrees opened 3 years ago

evanroyrees commented 3 years ago

Autometa Methods Documentation Video Series

This is a series of videos with voiceovers visualizing describing Autometa methods with visual aides generated by manim animations.

📝:movie_camera: :speaker:🎨 Scripts / Scenes / Animations / Voiceovers 📝 :movie_camera::speaker:🎨

Video Overview

Video 1 - Length filtering

Video 2 - Coverage calculation

Video 3 - ORF calling

Video 4 - Marker annotation

Video 5 - Taxon assignment

Video 6 - K-mer counting

Video 7 - K-mer embedding

Video 8 - 3 dimensions of clustering features

Video 9 - Binning with recursive DBSCAN

Video 10 - Unclustered recruitment


NOTE: Manim has been forked and is being maintained in two separate repositories.

From the ManimCommunity/manim repository:

This fork is updated more frequently than his, and it's recommended to use this fork if you'd like to use Manim for your own projects.

Example Scenes from Professor Jason Kwan's ASP presentation

jason-c-kwan commented 3 years ago

I've started a new animations repo at https://github.com/jason-c-kwan/Autometa_animations. Can we make the above list into a checklist?

So far I've made a sort of logo animation that can go at the beginning of each video.

jason-c-kwan commented 3 years ago

For the first video, I think I will show the BHtSNE graph for the same dataset as we change the length cutoff, while coloring the points based on the ground truth. @WiscEvan can you write instructions here on how I would use the Autometa entrypoints to basically do K-mer counting on all contigs, then do normalization and BHtSNE on different subsets? I am thinking I could programmatically chop up an internal Pandas table, but I just need a quick reminder of how to incorporate the BH-tSNE part into the script.

evanroyrees commented 3 years ago

k-mer counting

Subset kmers then normalize, embed and write to sample size filepath

#!/usr/bin/env python
# Save to subset_and_embed_counts.py
import argparse
import os
import pandas as pd
from autometa.common import kmers

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input", help="file path to kmer counts table", required=True)
    parser.add_argument("--output", help="directory path to store sample size embeddings", required=True)
    args = parser.parse_args()
    # Read in table
    # i.e. counts.tsv
    df = pd.read_csv(args.input, sep='\t', index_col='contig')

    # subsample by num. contigs specified
    sample_sizes = [100, 200, 400, 800, 1000, 5000, 10000]
    for sample_size in sample_sizes:
        counts_subset = df.sample(n=sample_size)
        norm_df = kmers.normalize(counts_subset, method="am_clr")
        # Write embedded to sample size path
        sample_embed_filepath = os.path.join(args.output, f"kmers.sample_size_{sample_size}.embedded.tsv")
        embedded_df = kmers.embed(
            kmers=norm_df,
            out=sample_embed_filepath,
            pca_dimensions=50,
            method="bhsne",
            embed_dimensions=2
        )
        print(f"Wrote sample size embedding to {sample_embed_filepath}")

if __name__ == "__main__":
    main()

Compute counts, subset and write embeddings

# Set filepaths and parameters
fasta="metagenome.fna"
kmers="counts.tsv"
outdir="path to store embeddings"
size=5
cpus=2

## Compute counts
autometa-kmers --fasta $fasta --kmers $kmers --size $size --cpus $cpus

## Subset and write embeddings
python subset_and_embed_counts.py --input $kmers --output $outdir
jason-c-kwan commented 3 years ago

I realized that the tasks should probably be made a bit more granular

jason-c-kwan commented 3 years ago

@WiscEvan Could you check out the script of video 1 and let me know what you think?

evanroyrees commented 3 years ago

I've updated my comment so the checklist can be reviewed when arriving at the page and so that it is in one place

jason-c-kwan commented 3 years ago

The video 8 idea seems to be a bit redundant with video 2. In order to explain why we need to calculate the coverage, I will have to bring up how we use two BH-tSNE dimensions and one coverage dimension. I'm not sure what else would need to be said?

evanroyrees commented 3 years ago

Yeah, video 8 could probably be grouped in with video 9