A Python library for creating and analyzing compacted colored de Bruijn Graphs powered by Bifrost with a NetworkX compatible API.
This library is still in an early development stage, and the API is still subject to change. Furthermore, not all functions from NetworkX are implemented yet.
Use the source tarball from Github to install the library:
pip install https://github.com/broadinstitute/pyfrost/releases/download/v0.1.0b1/pyfrost-0.1.0b1.tar.gz
PIP and Conda packages will be coming soon.
Build a graph from a list of references:
import pyfrost
g = pyfrost.build_from_refs(['data/ref1.fa', 'data/ref2.fa'])
Build a graph from reads:
g = pyfrost.build_from_samples([
'data/sample1.1.fq.gz', 'data/sample1.2.fq.gz',
'data/sample2.1.fq.gz', 'data/sample2.2.fq.gz'
])
Build a graph from both references and reads:
g = pyfrost.build(
['data/ref1.fa', 'data/ref2.fa'],
['data/sample1.1.fq.gz', 'data/sample1.2.fq.gz']
)
All calls above also accept optional parameters k
and g
for the k-mer and
minimizer size respectively.
Loading a colored graph created earlier with Bifrost:
g = pyfrost.load('graph.gfa')
Access some graph metadata:
>>> g.graph
{'k': 31, 'color_names': ['data/ref1.fa', 'data.ref2.fa']}
You can set custom metadata too:
>>> g.graph['custom_attr'] = "test"
Iterate over all unitigs (nodes) in the graph. Nodes are keyed by the first k-mer of the unitig.
>>> list(g.nodes)
[<Kmer 'TCGAA'>,
<Kmer 'CCACG'>,
<Kmer 'CGATG'>,
<Kmer 'ATGCG'>,
<Kmer 'GTGGC'>,
<Kmer 'ATCGA'>]
>>> len(g.nodes)
6
Access unitig sequence and node metadata
>>> str(g.nodes["TCGAA"])
TCGAAATCAGT
>>> g.nodes["TCGAA"]['unitig_sequence']
TCGAAATCAGT
>>> g.nodes["TCGAA"]['head']
<Kmer 'TCGAA'>
>>> g.nodes["TCGAA"]['tail']
<Kmer 'TCAGT'>
>>> g.nodes["TCGAA"]['strand']
Strand.FORWARD
>>> for k, v in g.nodes['TCGAA'].items():
...: print(k, "-", v)
...:
colors - <pyfrostcpp.UnitigColors object at 0x10d79d5b0>
unitig_sequence - TCGAAATCAGT
tail - TCAGT
length - 7
strand - Strand.FORWARD
mapped_sequence - TCGAAATCAGT
pos - 0
is_full_mapping - True
unitig_length - 7
head - TCGAA
length
and unitig_length
are in terms of number of k-mers, not nucleotides.
Mapped sequence and length will be discussed later.
Iterate over nodes and its neighbors:
>>> for n, neighbors in g.adj.items():
...: print("Current node:", str(g.nodes[n]))
...: for nbr in neighbors:
...: print("-", str(nbr))
...:
Current node: TCGAAATCAGT
Current node: CCACGGTGG
- GTGGC
Current node: CGATGC
- ATGCC
- ATGCG
Current node: ATGCGAT
- CGATG
Current node: GTGGCAT
- GCATC
Current node: ATCGA
- TCGAA
- TCGAT
Find a specific k-mer (not necessarily the head of a unitig):
>>> mapping = g.find('AAATC')
>>> for k, v in mapping.items():
...: print(k, "-", v)
...:
colors - <pyfrostcpp.UnitigColors object at 0x10d79d1b0>
unitig_sequence - TCGAAATCAGT
tail - TCAGT
length - 1
strand - Strand.FORWARD
mapped_sequence - AAATC
pos - 3
is_full_mapping - False
unitig_length - 7
head - TCGAA
As you can see, this k-mer is located on the same unitig used in an earlier
example. The metadata, however, now shows different values for
mapped_sequence
and length
, and is_full_mapping
is now False, because
this k-mer doesn't represent a whole unitig. head
and tail
still refer to
the head and tail k-mer of the whole unitig.
Bifrost builds colored compacted De Bruijn graphs, and keeps track which k-mers are observed in which references /samples. To obtain which colors are associated with a unitig or a k-mer, access 'colors' in the unitig metadata dict.
for n, data in g.nodes(data=True):
for c in data['colors']:
print("Node", n, "has color", c)
Note: By default data['colors']
iterates over all colors associated with any of the k-mers of that unitig.
However, it is possible that not all k-mers share the same colors. To access the colors of a specific k-mer of the
unitig use g.find()
or index:
for n, data in g.nodes(data=True):
print("Colors of first k-mer in the unitig:", set(data['colors'][0]))
>>> from pyfrost import Kmer, path_sequence, path_nucleotide_length
# Three nodes: ACTGATTTCGA, TCGAT, CGATGC
>>> path_sequence(g, [Kmer('ACTGA'), Kmer('TCGAT'), Kmer('CGATG')])
ACTGATTTCGATGC
>>> path_nucleotide_length(g, [Kmer('ACTGA'), Kmer('TCGAT'), Kmer('CGATG')])
14
Pyfrost includes a separate k-mer counter. It's still pretty unoptimzed and slow though, but it works.
>>> from pyfrost import KmerCounter
>>> counter = KmerCounter(31).count_kmers("ACTGCTAGCTAGCTACGTACGTACGATCGTACATGCATGC")
>>> counter["ACTGCTAGCTAGCTACGTACGTACGATCGT"]
1
# You can k-merize (gzipped) FASTA/FASTQ files directly
>>> counter = KmerCounter(31).count_kmers_files(["data/sample1.fq.gz", "data/sample.2.fq.gz"])
# Save counts for later use
>>> counter.save("sampe.counts")
# Load k-mer counts
>>> counter = KmerCounter.from_file("sample.counts")
Pyfrost is still under development and API is still subject to change.