This PR contains new features to automatically fetch data and prepare the input for crRNA design (e.g., sequence alignments and metadata).
It adds the following modules:
ncbi_neighbors: for downloading genome neighbor lists, sequence data, and metadata from NCBI [eef6ff2, 62ac3c0], including for influenza virus data from the flu database [5af92b7]
align: for curating sequence data by pairwise alignment to reference sequences, and generating multiple sequence alignments using mafft [d7709fc, 3645717]
cluster: for clustering sequences, with pairwise comparisons performed rapidly using MinHash signatures [c7c1dfe, 17e9104, 7809289, 042805c]
prepare_alignment: for combining the above to generate prepared input (e.g., alignments) from a taxonomic ID [3017a86, 72a858c]
The main workflow, implemented in prepare_alignment, is as follows:
Download a list of accessions to use as input
Download sequences and metadata for these accessions
Curate the sequences against a collection of reference sequences
Cluster the unaligned sequences
Generate one alignment per cluster
This also renames the main executable from design_guides.py to design.py [2ed4840, 6146064].
The PR adds a subcommand to design.py to specify the input type [8f422b3, b105c17, 7a1ebff]. The input type can be fasta (as previously required), as well as auto-from-args or auto-from-file. auto-from-args accepts a taxonomic ID (and segment and reference sequences) from the command-line arguments, while auto-from-file reads these from a file, giving the data across multiple taxonomies, to perform differential identification. Critically, design.py links preparation with design: after preparing input, it automatically designs from this input.
This also adds a feature (--sample-seqs) to sample input data randomly with replacement [18cc35c, 9491b03]. This can be useful for obtaining a measure of dispersion on the output designs.
It adds the ability to memoize alignments and alignment statistics [d7709fc, 4ddddc7], via the --prep-memoize-dir argument, which is critical for runtime if designs are to be repeatedly generated on highly similar data.
The PR also fetches and parses metadata for sequences. Namely, it uses collection years to implement the cover-by-year-decay feature on prepared input [e11bbb8, 0be27be, 70a5d95, 8a3c375].
This PR contains new features to automatically fetch data and prepare the input for crRNA design (e.g., sequence alignments and metadata).
It adds the following modules:
ncbi_neighbors
: for downloading genome neighbor lists, sequence data, and metadata from NCBI [eef6ff2, 62ac3c0], including for influenza virus data from the flu database [5af92b7]align
: for curating sequence data by pairwise alignment to reference sequences, and generating multiple sequence alignments using mafft [d7709fc, 3645717]cluster
: for clustering sequences, with pairwise comparisons performed rapidly using MinHash signatures [c7c1dfe, 17e9104, 7809289, 042805c]prepare_alignment
: for combining the above to generate prepared input (e.g., alignments) from a taxonomic ID [3017a86, 72a858c]The main workflow, implemented in
prepare_alignment
, is as follows:This also renames the main executable from
design_guides.py
todesign.py
[2ed4840, 6146064].The PR adds a subcommand to
design.py
to specify the input type [8f422b3, b105c17, 7a1ebff]. The input type can befasta
(as previously required), as well asauto-from-args
orauto-from-file
.auto-from-args
accepts a taxonomic ID (and segment and reference sequences) from the command-line arguments, whileauto-from-file
reads these from a file, giving the data across multiple taxonomies, to perform differential identification. Critically,design.py
links preparation with design: after preparing input, it automatically designs from this input.This also adds a feature (
--sample-seqs
) to sample input data randomly with replacement [18cc35c, 9491b03]. This can be useful for obtaining a measure of dispersion on the output designs.It adds the ability to memoize alignments and alignment statistics [d7709fc, 4ddddc7], via the
--prep-memoize-dir
argument, which is critical for runtime if designs are to be repeatedly generated on highly similar data.The PR also fetches and parses metadata for sequences. Namely, it uses collection years to implement the cover-by-year-decay feature on prepared input [e11bbb8, 0be27be, 70a5d95, 8a3c375].