broadinstitute / adapt

A package for designing activity-informed nucleic acid diagnostics for viruses.
MIT License
28 stars 1 forks source link

Automatically download and prepare input for design #10

Closed haydenm closed 5 years ago

haydenm commented 5 years ago

This PR contains new features to automatically fetch data and prepare the input for crRNA design (e.g., sequence alignments and metadata).

It adds the following modules:

The main workflow, implemented in prepare_alignment, is as follows:

  1. Download a list of accessions to use as input
  2. Download sequences and metadata for these accessions
  3. Curate the sequences against a collection of reference sequences
  4. Cluster the unaligned sequences
  5. Generate one alignment per cluster

This also renames the main executable from design_guides.py to design.py [2ed4840, 6146064].

The PR adds a subcommand to design.py to specify the input type [8f422b3, b105c17, 7a1ebff]. The input type can be fasta (as previously required), as well as auto-from-args or auto-from-file. auto-from-args accepts a taxonomic ID (and segment and reference sequences) from the command-line arguments, while auto-from-file reads these from a file, giving the data across multiple taxonomies, to perform differential identification. Critically, design.py links preparation with design: after preparing input, it automatically designs from this input.

This also adds a feature (--sample-seqs) to sample input data randomly with replacement [18cc35c, 9491b03]. This can be useful for obtaining a measure of dispersion on the output designs.

It adds the ability to memoize alignments and alignment statistics [d7709fc, 4ddddc7], via the --prep-memoize-dir argument, which is critical for runtime if designs are to be repeatedly generated on highly similar data.

The PR also fetches and parses metadata for sequences. Namely, it uses collection years to implement the cover-by-year-decay feature on prepared input [e11bbb8, 0be27be, 70a5d95, 8a3c375].