Motif finding in protein sequences.

Thernn88 commented 4 months ago

Hello! What a wonderful program you have here.

I was writing to ask if lightmotif is capable of searching for motifs in protein sequences. It looks like functionality is partially in place but I couldn't get it to run with some simple code tweaks. It kept crashing as it was trying to reverse compliment a protein sequence. Maybe I'm doing something wrong?

I'm looking to use motif searching techniques to find short, but conserved exons in genomes. Similarity approaches such as Blast, Diamond, HMMER aren't able to find these. We

When mining a genome, we know roughly where an exon is because it is nestled between or adjacent to longer exons. Thus we only have to search a miniscule fraction of the actual genome. We've done a basic proof of concept which shows the idea works. Unfortunately, all the references sequences are only available in amino acids. The nucleotide format is not available for the references. However, we are able to pass nucleotides or translations for the genome we are mining if that would help.

We were hoping to use your software so we didn't have to reinvent the wheel so to say.

Is this protein motif searching possible in lightmotif? If not, is it something you can add?

althonos commented 4 months ago

Hi @Thernn88,

The Rust library is capable of running on protein sequences, although the implementation is slower than for DNA (because of DNA having only 5 symbols vs the 20 proteinogenic amino acids, some optimizmations don't work for larger alphabets).

If you tried using the Python library, I have not added support yet because I need to hardcode the protein alphabets -- the alphabets must be defined at compile-time in LightMotif (this allows some loop unrolling based on the dimension of the alphabet, $K$); I can try adding support if that's something you need.

Thernn88 commented 4 months ago

Hi Althonos,

Yes! This would be very useful. We would greatly appreciate you adding this.

We aren't terribly worried about speed. The search space is so small for these (capped at 20000bp) that it shouldn't be noticable. I/O is probably a larger concern.

althonos commented 4 months ago

In that case, to be honest, I think you can use the Bio.motifs module from Biopython -- it is functionally equivalent (same algorithm but implemented naively), and supports proteins for sure (with a little bit of configuration options).

althonos commented 2 months ago

Protein sequences are now supported in v0.9.0, just use lightmotif.create(..., protein=True) and lightmotif.stripe(..., protein=True) to treat the input as protein sequences.

althonos / lightmotif

Motif finding in protein sequences. #5