dahak-metagenomics / dahak

benchmarking and containerization of tools for analysis of complex non-clinical metagenomes.
https://dahak-metagenomics.github.io/dahak
BSD 3-Clause "New" or "Revised" License
21 stars 4 forks source link

Gene-level summaries of metagenomes – appropriate for dahak? #38

Open sminot opened 6 years ago

sminot commented 6 years ago

I've been working on doing gene-level summaries of metagenomes by exhaustively mapping reads with DIAMOND against a comprehensive non-redundant database of proteins (UniRef90). The result of this process is a list of which genes are present, whether or not they have a known function.

All of my code thus far is in https://github.com/fhcrc-microbiome/docker-diamond

My question for you is, do you think that this type of gene-level data is useful for the dahak project? Also, I'm thinking of making genome-level summaries from this data (based on which of the UniRef90 references are mapped to which genomes in GenBank). Would that genome-level data be useful for dahak.

If so, I'm happy to contribute. Just wanted to run it by you all first.

brooksph commented 6 years ago

Hi Sam! Your plans sound really exciting, we'd be happy to have you contribute, and both the gene and genome level data would be useful for us. We're in the process of generating some additional datasets but for now we have four; one complete dataset and three generated by subsampling the original data at 10, 25, and 50 percent of the reads. You can access these data on the open science framework in project dm938. The commands are in the read filtering protocol. You just need an account or I can get you the direct links. I can also get you the trimmed data if that's helpful. If you'd like we can chat more about the specific goals of the project.

kternus commented 6 years ago

Related to this thread, @sminot posted a preprint for Functional Analysis of Metagenomes by Likelihood Inference (FAMLI) today: https://www.biorxiv.org/content/early/2018/04/05/295352

Software here: https://github.com/FredHutch/FAMLI Docker images here: https://quay.io/repository/fhcrc-microbiome/famli

Thanks for sharing your work online, @sminot! Any new thoughts you'd like to share since last year? It looks like you put a lot of thought into simulating datasets to evaluate how well FAMLI was capturing metagenome gene content. Are those datasets available too?

sminot commented 6 years ago

I'm happy to make those datasets available, just let me know where they can go.

The topic where I hoped this would overlap would be to use the gene content of an isolate as a measure of the strain-level identity. Is that something of interest to this group?

brooksph commented 6 years ago

Hey @sminot how big are your datasets? We have been storing all of our datasets on the open science framework. It's a really nice resource with free storage but there is a 5 GB per file limit.

I will take a deeper dive into your software and preprint before I come up with some solid ideas but off hand, this seems like something that would really helpful for us. We would definitely be interested in generating gene-level summaries of some of our mock communities. I think one place where this could really come in handy is with our contig annotation benchmarking. We've generated some assemblies with spades and megahit and then annotated with prokka. We noticed that there are large differences in the name and number of annotations generated by the two assemblers (https://github.com/dahak-metagenomics/dahak/blob/master/workflows/functional_inference/prokka_annotation_megahit/Contig_annotation_comparison.ipynb), but we don't have a truth gene set so it's hard to make any firm conclusions about what's going on there.

sminot commented 6 years ago

The simulated communities are both < 1Gb, so I'd be happy to add them to your OSF project. When you think about annotating genes, it's good also to keep an eye on what reference database you're using for the annotation. In either case, I'm happy to upload the simulated datasets and the Prokka annotations of those reference genomes, if it would be helpful. Just let me know.

brooksph commented 6 years ago

Good point! Do you have an OSF account? I need your username to give you read/write privileges. @kternus and @stephenturner do you also want read/write access?

sminot commented 6 years ago

I just signed up for an OSF account with my email (sminot@fredhutch.org)

brooksph commented 6 years ago

@sminot @kternus and @stephenturner you all have read and write access. @sminot our project files are here https://osf.io/dm938/ . Can you stick them in the data directory?

sminot commented 6 years ago

I just uploaded the data (fastq + abundance CSV) to that directory

sminot commented 6 years ago

I'm also happy to contribute a Docker image that runs FAMLI (see recent preprint from my group - https://www.biorxiv.org/content/early/2018/04/05/295352) and blog post (https://www.minot.bio/home/2018/4/4/famli).

The Docker image is currently hosted publicly on Quay: https://quay.io/repository/fhcrc-microbiome/famli?tab=builds

I'm happy to add whatever code you need to make it compatible with your existing workflows, if you can point me to some documentation.

The output would be a list of protein-coding genes, and their abundance in a sample of interest.

brooksph commented 6 years ago

Hey @sminot! Thanks for putting the data on the osf.

For the most part, we've contributed to the biocontainers project by submitting recipes to bioconda which automates the container building and stores the Docker image on quay.io or DockerHub. Sometimes updating the recipes there is difficult so occasionally we've built our own images and hosted them on quay.io. The only specifications we've set up for images on this project is that the images are public and versioned, so I think we're good there. Excited about using FAMLI!