Add functionality to process amplicon resequencing data (for phylogenetics)

griffinp commented 9 years ago

[Suggestion to add amplicon resequencing data-processing functionality, brief description of new scripts to aid in this, and request for guidance on unit testing or code review]

Justification for adding functionality

Phylogenies are limited in power unless they contain multiple nuclear and organellar loci, and multiple individuals per species. Next-generation sequencing makes it affordable to obtain this kind of increased marker resolution, and also avoids problems associated with sequencing multiple marker copies and polyploids that have complicated phylogenetic inference in the past. However, general-purpose, easy-to-use software for processing this kind of data is lacking. The available approaches tend to focus on a single, specific genetic marker, or on detecting known genetic variants in a heterozygote (e.g. disease-associated variants). Other studies have built their own workflows from numerous independent software tools, an approach that is extremely challenging for the average molecular taxonomist to recreate.

I suggest it would be valuable to add amplicon-resequencing functionality to Qiime and that it would be widely used. Even current Qiime users who focus on metagenomics are likely to sometimes have targeted amplicon resequencing data to deal with, and there are many taxonomic researchers doing both small and large-scale projects who would greatly benefit from this tool.

Description of my possible contribution

I have designed a workflow that uses existing Qiime scripts for early-stage data processing, very conservative read clustering with cdhit, and BLAST matching to a reference sequence list (using an alternative version of the standard blast_wrapper script that indicates read direction). The workflow then incorporates three new scripts (see below).

I've used these scripts successfully on several datasets. They are available in my Github fork of the Qiime repository: https://github.com/griffinp/qiime-amp_reseq

I believe they adhere pretty well to the Coding Guidelines, but I have not yet fully implemented unit testing (yes, I know you're supposed to write unit tests as you go, but not having a programming background I have found it very difficult to discover exactly how this should work).

So, I am not quite sure how to proceed now. I guess I first need to check that the Qiime developers think adding this functionality would be useful. My code is not ready for a pull request (since the tests are lacking) but if anybody can provide some code review, or if you can point me to any step-by-step guidelines for setting up unit testing, that would be great.

Thanks very much in advance for your assistance Pip Griffin pip.griffin@gmail.com

New scripts in 'amp_reseq' workflow

process_clusters This script identifies the top 'n' clusters per amplicon per individual ('top' in terms of cluster membership size), obtains all the relevant reads per cluster, and outputs a new .fasta file for each cluster
align_per_cluster This script then uses MAFFT to align each cluster and output a consensus sequence. In practice this consensus sequence represents either an amplicon fragment (if amplicons were fragmented in the process of library construction), or an entire amplicon (if amplicons were not fragmented)
assemble_per_amplicon For each amplicon, this script aligns all the cluster consensus fragments to a reference amplicon sequence using MAFFT, and outputs the resulting assembly (without the reference sequence). Running this script is only necessary if amplicons were fragmented during library construction. Users can then check the assembly manually and either create their own consensus sequence, or identify multiple amplicon copies/alleles within the assembly.

ElDeveloper commented 9 years ago

@griffinp I am unfamiliar with these use-cases/technologies so I will let others (@gregcaporaso, @antgonza) provide answers to your questions. The following is a short summary on how to get started on unit testing: http://scikit-bio.org/docs/latest/development/coding_guidelines.html#how-should-i-test-my-code

On (Nov-03-14|19:57), griffinp wrote:

[Suggestion to add amplicon resequencing data-processing functionality, brief description of new scripts to aid in this, and request for guidance on unit testing or code review]

Justification for adding functionality

Phylogenies are limited in power unless they contain multiple nuclear and organellar loci, and multiple individuals per species. Next-generation sequencing makes it affordable to obtain this kind of increased marker resolution, and also avoids problems associated with sequencing multiple marker copies and polyploids that have complicated phylogenetic inference in the past. However, general-purpose, easy-to-use software for processing this kind of data is lacking. The available approaches tend to focus on a single, specific genetic marker, or on detecting known genetic variants in a heterozygote (e.g. disease-associated variants). Other studies have built their own workflows from numerous independent software tools, an approach that is extremely challenging for the average molecular taxonomist to recreate.

I suggest it would be valuable to add amplicon-resequencing functionality to Qiime and that it would be widely used. Even current Qiime users who focus on metagenomics are likely to sometimes have targeted amplicon resequencing data to deal with, and there are many taxonomic researchers doing both small and large-scale projects who would greatly benefit from this tool.

Description of my possible contribution

I have designed a workflow that uses existing Qiime scripts for early-stage data processing, very conservative read clustering with cdhit, and BLAST matching to a reference sequence list (using an alternative version of the standard blast_wrapper script that indicates read direction). The workflow then incorporates three new scripts (see below).

I've used these scripts successfully on several datasets. They are available in my Github fork of the Qiime repository: https://github.com/griffinp/qiime-amp_reseq

I believe they adhere pretty well to the Coding Guidelines, but I have not yet fully implemented unit testing (yes, I know you're supposed to write unit tests as you go, but not having a programming background I have found it very difficult to discover exactly how this should work).

So, I am not quite sure how to proceed now. I guess I first need to check that the Qiime developers think adding this functionality would be useful. My code is not ready for a pull request (since the tests are lacking) but if anybody can provide some code review, or if you can point me to any step-by-step guidelines for setting up unit testing, that would be great.

Thanks very much in advance for your assistance Pip Griffin pip.griffin@gmail.com

New scripts in 'amp_reseq' workflow

process_clusters This script identifies the top 'n' clusters per amplicon per individual ('top' in terms of cluster membership size), obtains all the relevant reads per cluster, and outputs a new .fasta file for each cluster

align_per_cluster This script then uses MAFFT to align each cluster and output a consensus sequence. In practice this consensus sequence represents either an amplicon fragment (if amplicons were fragmented in the process of library construction), or an entire amplicon (if amplicons were not fragmented)

assemble_per_amplicon For each amplicon, this script aligns all the cluster consensus fragments to a reference amplicon sequence using MAFFT, and outputs the resulting assembly (without the reference sequence). Running this script is only necessary if amplicons were fragmented during library construction. Users can then check the assembly manually and either create their own consensus sequence, or identify multiple amplicon copies/alleles within the assembly.

Reply to this email directly or view it on GitHub: https://github.com/biocore/qiime/issues/1706

antgonza commented 9 years ago

Hi @griffinp, this is potentially pretty cool. Thanks for approaching us. The easiest for code review is to have a PR but before that let me ask you a few questions so we can give you better guidance:

Do you see this as a separate tool, which has qiime/skbio/other as a dependency or part of qiime? Both options have pros and cons.
When do you expect having the scripts/tests ready? This is important because we are approaching the end of the 1.9 release cycle so if this is not ready in the next 10 days, it will go to cycle 2.0, of which we do not have an actual release date.
Have you used these scripts for published work? If so, can you send us pubmed ids?

If you want to discuss this directly with us, please send me an email directly with those answers.

griffinp commented 9 years ago

Thanks for the replies @ElDeveloper and @antgonza. I've emailed @antgonza in response to his questions and am working on the unit testing.

biocore / qiime