@mrocklin wrote code using his excellent Toolz library to stream through a human genome dataset and build a Markov model (search for "human genome" in above notebook).
A lightly edited version:
from cytoolz import drop, pipe, sliding_window, merge_with, frequencies
from cytoolz.curried import map
from glob import glob
def genome(file_pattern):
"""Stream a genome from a list of FASTA filenames"""
return pipe(file_pattern, glob, sorted, # Filenames
map(open), # Open each file
map(drop(1)), # Drop header from each file
concat, # Concatenate all lines from all files together
map(str.upper), # Upper case each line
map(str.strip), # Strip off \n from each line
concat) # Concatenate all lines into one giant string sequence
def markov(seq):
"""Get a 2nd-order Markov model from a sequence"""
return pipe(seq, sliding_window(3), # Each successive triple{(A, A): {T: 10}}
frequencies, # Count occurrences of each triple
dict.items, map(markov_reshape), # Reshape counts so {(A, A, T): 10} -> {(A, A): {T: 10}}
merge_with(merge)) # Merge dicts from different pairs
def markov_reshape(item):
((a, b, c), count) = item
return {(a, b): {c: count}}
Code by @mrocklin Submitted by @mrocklin / @jni
@mrocklin wrote code using his excellent Toolz library to stream through a human genome dataset and build a Markov model (search for "human genome" in above notebook).
A lightly edited version:
Then:
Returns: