galaxyproject / tools-iuc

Tool Shed repositories maintained by the Intergalactic Utilities Commission
https://galaxyproject.org/iuc
MIT License
161 stars 417 forks source link

Prototype HIV NGS pipeline #1020

Open spond opened 7 years ago

spond commented 7 years ago

Following our discussion with @nekrut , I am creating an issue to outline essential components for creating a barebones HIV (or other short RNA virus) genomic analysis pipeline.

Key steps for amplicon data (a very preliminary outline, many steps need to be fleshed out, initial focus is on wrapping some existing tools that we have worked with in the past)

  1. Initial QC on NGS data

    • [x] Standard FASTQ filtering tools (q-score, read length, N's), e.g. qfilt
    • [x] paired end merging
  2. (If present in run) random tag (PRIMER ID) processing

    • [ ] Binning and consensus construction
    • [ ] Filtering using an error model, like the one from the Swanstrom Lab
  3. Map to reference (amplicon data), FASTQ => BAM

    • [ ] Custom mappers (direct alignment, e.g. Smith Waterman)
    • [x] [bealign] (https://github.com/veg/BioExt/)
    • [ ] Standard short read mappers
    • [ ] Coverage maps
  4. Filter errors (amplicon data). Need more options here

    • [ ] Model-based, multinomial, filtering Currently aligned FASTA => filtered aligned FASTA. Should be BAM => filtered BAM
    • [ ] See if multiple alignments can be replaced with VCFs or something along these lines
  5. Basic post processing based on TN93

    • [ ] Generate consensus sequences
    • [ ] Sliding window haplotypes (simple read merging, large overlap)
    • [ ] Report amino-acid frequencies by position (and DRAMs) -- need to write
  6. Basic phylogenetics (need to pull out some code from https://github.com/veg/HIV-NGS)

  7. HIV clustering

    • [ ] hivnetworkcsv

A few references to existing (published) pipelines

  1. The VEG group pipeline
  2. Sanger WGS
  3. ShoRAH
  4. Richard Neher's intrahost analysis app
  5. An older comparative study of haplotype reconstruction

WRAP DATAMONKEY INTO GALAXY

stevenweaver commented 7 years ago

2016-11-16 update : qfilt, bealign, and tn93 have been added to the test toolshed hosted by the core galaxy team at PSU. The tools currently reside in the veg fork of the tools-iuc repo, and they are installed on local instances of galaxy in the VEG group. All planemo tests pass, and bioext's python packaging bug (an issue where users installing through pip would come across "header not found" errors) has been resolved.

bgruening commented 7 years ago

@stevenweaver top!!!

stevenweaver commented 7 years ago

@bgruening this progress wouldn't have been possible without @davebx

bgruening commented 7 years ago

That what makes Galaxy so great - community! @davebx thanks a bunch! And let me know if you need help with the pending conda package.

nekrut commented 6 years ago

2/20 Meeting Notes @nekrut @davebx

NEXT MEETING FRI @ 3 (March 2)

spond commented 6 years ago

@nekrut : here's the paper on PRIMER ID processing.http://jvi.asm.org/content/89/16/8540.full.pdf+html

davebx commented 6 years ago

Per my estimation, bealign is IUC ready, TN93 and qfilt need a few best-practices tweaks.

nekrut commented 6 years ago

Examples of HIV processing pipeline

An example paper is Gianella et al. 2016

  1. Take individual patient level reads; there are multiple time points per individual
  2. Error correct them / assemble (these are short amplicons, so assembly is not even that important)
  3. Build phylogenies and interpret the results

Another example, which now uses full length genome data is by Zanini et al. 2015.

The basic workflow

  1. Assemble individual HIV-genomes
  2. Call variants relative to the genome
  3. Perform downstream analysis
nekrut commented 6 years ago

Meeting 5/30

stevenweaver commented 6 years ago

Meeting 6/6

davebx commented 6 years ago
davebx commented 6 years ago

Meeting june 20:

davebx commented 6 years ago

Meeting july 18: