althonos / pymemesuite

Cython bindings and Python interface to the MEME suite, a collection of tools for the analysis of sequence motifs.
MIT License
10 stars 0 forks source link

`fasta-get-markov` implementation? #1

Open j-andrews7 opened 1 year ago

j-andrews7 commented 1 year ago

Hi there. Was pretty excited to stumble upon this project.

Do you have any plans to implement the fasta-get-markov functionality to generate a Markov model from sequences and pass to the corresponding FIMO call? I've got a hackathon project coming up in a few months for which this package would be incredibly useful.

I'd offer to implement it myself, but I am not familiar with cython at all. If this functionality is already possible, an example would be much appreciated!

althonos commented 1 year ago

Hi @j-andrews7,

To be fair the MEME suite code has been the most challenging for me to get working in the different projects I've worked on, so that's why I only made an alpha release to date. I'll try to have a look when I can, but I can't guarantee when (I'm on my final PhD year :sweat_smile: )!

j-andrews7 commented 1 year ago

I feel ya. I'll keep an eye on the repo, good luck in your final year!

althonos commented 1 year ago

Hi @j-andrews7,

In case the hackathon is not passed already (:yum:) I have improved the way background sequences are generated so that you can pass it to FIMO. This is not a complete implementation of fasta-get-markov, but since FIMO only expects zero-order Markov frequencies that's all I added.

Now FIMO methods expect a Background object instead of a raw frequencies array to pass for background frequencies when building the PSSM and/or computing p-values. A Background object can be created from various constructors:

from pymemesuite.common import Alphabet, Background, Sequence, Array
alphabet = Alphabet.dna()

# explicit constructor from an array of frequencies
bg = Background(dna, Array([0.27, 0.22, 0.22, 0.27]))  

# uniform background
bg_uniform = Background.from_uniform(dna)  

 # frequencies from the nr database
bg_nrdb = Background.from_nrdb(dna) 

# load from a background file, e.g. obtained with `fasta-get-markov`
bg_file = Background.from_file(dna, "background.txt") 

# compute from given sequences 
# (control reverse-complement and pseudocounts with the `rc` and `pseudo` keyword arguments)
bg_seq = Background.from_sequences(dna, Sequence("ATGC"), Sequence("ATTA"))

The last one is what you want, but you'll have to manage the loading from a FASTA file yourself (e.g. with Biopython). Afterwards you can use that background in Motif.build_pssm or in FIMO.score_motif.

j-andrews7 commented 1 year ago

Still a month off, plenty of time to spare! I will take a look. I also forgot that Biopython itself has motif scanning, which I may also give a try.