Derive a Markov model from the human genome in <10min with ~10 lines of code

Code by @mrocklin Submitted by @mrocklin / @jni

@mrocklin wrote code using his excellent Toolz library to stream through a human genome dataset and build a Markov model (search for "human genome" in above notebook).

A lightly edited version:

from cytoolz import drop, pipe, sliding_window, merge_with, frequencies
from cytoolz.curried import map
from glob import glob

def genome(file_pattern):
    """Stream a genome from a list of FASTA filenames"""
    return pipe(file_pattern, glob, sorted,        # Filenames
                                 map(open),        # Open each file
                                 map(drop(1)),     # Drop header from each file
                                 concat,           # Concatenate all lines from all files together
                                 map(str.upper),   # Upper case each line
                                 map(str.strip),   # Strip off \n from each line
                                 concat)           # Concatenate all lines into one giant string sequence

def markov(seq):
    """Get a 2nd-order Markov model from a sequence"""
    return pipe(seq, sliding_window(3),          # Each successive triple{(A, A): {T: 10}}
                     frequencies,                # Count occurrences of each triple
                     dict.items, map(markov_reshape),   # Reshape counts so {(A, A, T): 10} -> {(A, A): {T: 10}}
                     merge_with(merge))          # Merge dicts from different pairs

def markov_reshape(item):
    ((a, b, c), count) = item
    return {(a, b): {c: count}}

Then:

%time pipe('/home/mrocklin/data/human-genome/chr*.fa', genome, markov)

Returns:

CPU times: user 9min 30s, sys: 1.89 s, total: 9min 32s
Wall time: 9min 32s

elegant-scipy / elegant-scipy-submissions

Derive a Markov model from the human genome in <10min with ~10 lines of code #3