strainFlye is a pipeline for calling, analyzing, and phasing rare single-nucleotide mutations in metagenome-assembled genomes produced from long and accurate reads—for example, PacBio HiFi reads.
The main inputs to strainFlye are 1) reads and 2) contigs. However, most steps in the pipeline can be "jumped to" if you already have other files prepared (e.g. an alignment of reads to contigs, or existing single-nucleotide mutation calls within contigs). See below for a flow chart of strainFlye's pipeline (created with Mermaid); if you have trouble viewing this flow chart, here are some other ways to view it.
flowchart LR
classDef art fill:#acfeb1,stroke:#086408,color:black
classDef pgm fill:#88bbff,stroke:#3333aa,color:black
classDef ext fill:#ffbb88,stroke:#755237,color:black
0(("Contigs<br/>(FASTA)")):::art --> A:::pgm;
1(("Reads<br/>((gzipped) FASTA<br/>or FASTQ)")):::art ---> A
2(("Assembly graph<br/>(GFA 1)")):::art -.->|"(Optional)"| A
A[strainFlye align] --> 3(("Alignment of<br/>reads to contigs<br/>(Indexed BAM)")):::art
3 --> C["strainFlye call p-mutation<br/>(Call mutations<br/>using frequencies)"]:::pgm
0 --> C
3 --> D["strainFlye call r-mutation<br/>(Call mutations<br/>using read counts)"]:::pgm
0 --> D
C --> 4(("Called mutations<br/>(Indexed BCF)")):::art
C --> 5(("Diversity indices<br/>(TSV)")):::art
D --> 4
D --> 5
0 --> E[strainFlye fdr estimate]:::pgm
3 --> E
4 --> E
5 -..->|"(Optional)"| E
E --> 6(("FDR estimates<br/>(TSV)")):::art
E --> 7(("Number of<br/>mutations / Mb<br/>(TSV)")):::art
7 --> Z{"Plotting FDR curves<br/>(see tutorial)"}:::ext
6 --> Z
6 --> F[strainFlye fdr fix]:::pgm
4 --> F --> 11(("Mutations with<br/>fixed FDR<br/>(Indexed BCF)")):::art
11 --> G[strainFlye spot hot-features]:::pgm
8(("Contig #quot;features#quot;<br/>[probably genes]<br/>(GFF3)")):::art --> G
G --> 9(("Hotspot features<br/>(TSV)")):::art
11 --> H[strainFlye spot cold-gaps]:::pgm --> 10(("Coldspot gaps<br/>(TSV)")):::art
4 -.->|If desired, you can skip FDR estimation and fixing and use the unfiltered BCF file for downstream analyses| 11
11 --> I[strainFlye smooth create]:::pgm --> 12(("Smoothed and<br/>virtual reads<br/>(gzipped FASTA)")):::art
0 --> I
3 --> I
5 -.->|"(Optional)"| I
12 --> J[strainFlye smooth assemble]:::pgm --> 13(("LJA assemblies")):::art
11 --> K
0 --> K[strainFlye link nt]:::pgm --> 14(("Nucleotide<br/>(co-)occurrence<br/>information<br/>(Pickle)")):::art
3 --> K
14 --> L[strainFlye link graph]:::pgm --> 15(("Link graphs<br/>(DOT or Pickle)")):::art
0 --> M[strainFlye matrix count]:::pgm
3 --> M
8 ---> M
M --> 16(("3-mer count<br/>information<br/>(Pickle)")):::art
16 --> N[strainFlye matrix fill]:::pgm --> 17(("Mutation matrices<br/>(TSV or JSON)")):::art
0 --> O[strainFlye dynam covskew]:::pgm --> 18(("Coverage and<br/>skew statistics<br/>(TSV)")):::art
3 --> O
2 --> P[strainFlye utils gfa-to-fasta]:::pgm --> 0
Long story short, strainFlye is an ordinary Python package (currently supporting Python 3.6 and 3.7). strainFlye depends on a few non-Python tools (e.g. minimap2, Prodigal, SAMtools), also.
The simplest way to install strainFlye is by using conda (or mamba). strainFlye is available through the bioconda channel:
conda install -c bioconda strainflye
If you run into dependency conflicts when installing strainFlye into an existing conda environment, you may want to just create a new conda environment and install strainFlye into that. (See the "workaround solution" given in this issue for an example.)
If that still doesn't work, you can try installing from source -- see below.
And if that doesn't work, no worries -- installing this stuff can be finicky. Feel free to open an issue and I'll try to help out.
Please see the development documentation (CONTRIBUTING.md
) for instructions.
strainFlye smooth assemble
strainFlye's smooth
module includes two commands. The first,
strainFlye smooth create
, creates smoothed and virtual reads for each contig;
the second, strainFlye smooth assemble
, assembles these reads using
LJA. LJA is not installed using the
conda installation instructions above, so—in order to run the
strainFlye smooth assemble
command—you will need to
install the LJA software (in particular, the
simple_ec
branch of
LJA).
Please see LJA's manual for the most up-to-date installation instructions. Assuming that you have all of LJA's requirements installed, something like the following should work:
git clone https://github.com/AntonBankevich/LJA.git
cd LJA
git checkout simple_ec
cmake .
make
... but this is subject to change as LJA is updated.
This Jupyter Notebook provides a tutorial demonstrating how to use strainFlye.
First, make sure you've activated the conda environment into which you installed strainFlye. From there, you can use strainFlye like any other command-line tool:
Usage: strainFlye [OPTIONS] COMMAND [ARGS]...
Pipeline for the analysis of rare mutations in metagenomes.
Please consult https://github.com/fedarko/strainFlye if you have any
questions, comments, etc. about strainFlye. Thank you for using this tool!
Options:
-v, --version Show the version and exit.
-h, --help Show this message and exit.
Commands:
align Align reads to contigs, and filter the resulting alignment.
call [+] Call mutations in contigs naïvely & compute diversity indices.
fdr [+] Estimate and fix FDRs for contigs' naïve mutation calls.
spot [+] Identify putative mutational hotspots or coldspots.
smooth [+] Create and assemble smoothed and virtual reads.
link [+] Create link graphs showing co-occurring alleles.
matrix [+] Create codon and amino acid mutation matrices.
dynam [+] Compute simple information about growth dynamics.
utils [+] Miscellaneous utility commands provided with strainFlye.
Click on the headers below (e.g. "Alignment") to show details about the corresponding command(s).
Input files / folders are listed in parentheses next to each command; optional
inputs are given in [square brackets]. Parameters are omitted for the sake of brevity.
For more details, try running these commands from the terminal (e.g. strainFlye align --help
).
If you're interested in making changes to strainFlye's code, please see
CONTRIBUTING.md
for some tips on getting started.
An article describing strainFlye is available in Genome Research at this link. If you use strainFlye in your work, please cite it! The BibTeX for this article is:
@article{fedarko2022,
author={Fedarko, Marcus W and Kolmogorov, Mikhail and Pevzner, Pavel A},
title={Analyzing rare mutations in metagenomes assembled using long and accurate reads},
journal={Genome Research},
publisher={Cold Spring Harbor Laboratory Press},
volume={32},
number={11-12},
pages={2119--2133},
year={2022},
doi={10.1101/gr.276917.122},
url={http://genome.cshlp.org/content/32/11-12/2119.abstract},
eprint={http://genome.cshlp.org/content/32/11-12/2119.full.pdf+html}
}
If you use strainFlye align
, please also cite minimap2 (Li 2018),
which is used by this strainFlye command to perform alignment.
(Depending on the version of minimap2 you use, you may also want to cite
Li 2021.)
If you use strainFlye smooth assemble
, please also cite LJA
(Bankevich et al., 2022),
which is used by this strainFlye command to perform assembly.
Some of strainFlye's commands use Python's
subprocess
module to run
non-Python software: minimap2, samtools, bcftools, Prodigal, LJA, etc.
Most of the time, we do this using subprocess.run()
with shell=False
:
long story short, this helps prevent the problem of
shell injection.
However, as of writing, there are two places where strainFlye uses
subprocess.run()
with shell=True
: in strainFlye align
(when running
minimap2 and samtools), and in strainFlye smooth assemble
(when running LJA).
This is for convenience's sake, since we allow the user to pass in extra
parameters to these commands (the --minimap2-params
option for strainFlye align
, and the --lja-params
option for strainFlye smooth assemble
).
Our use of shell=True
in these two cases means that it's possible to make
these commands do unexpected things (see
Python's documentation here for details).
This should not be a problem if you are running strainFlye directly. However,
if you decide to host strainFlye on a server somewhere (and you allow users to
upload files, specify parameters, etc.) then you should be careful about
preventing shell injection in these cases. Feel free to open an issue if you
have any questions about this.
sample1.gfa
(located in strainflye/tests/inputs/
)
was downloaded from the gfalint
repository. The other GFA files in this folder beginning with sample1
are
also based on this GFA file.
The strainFlye tutorial (linked above, and located in docs/
) uses the "SheepGut" dataset shown in the strainFlye paper. This dataset was previously described in Kolmogorov et al., 2020 and Bickhart & Kolmogorov et al., 2022.
If you have any questions, suggestions, comments, etc. about strainFlye, please feel free to open an issue in this repository!