betteridiot / seqlogo

Python port of the R Bioconductor `seqLogo` package
BSD 3-Clause "New" or "Revised" License
31 stars 3 forks source link

All or some PPM columns do not add to 1 #16

Open Mitmischer opened 4 months ago

Mitmischer commented 4 months ago

Describe the bug The program will fail on imperfect input data.

To Reproduce Steps to reproduce the behavior: 1) Read the following probability matrices: consensus_pwms_stripped.txt For example, consider adapting the code here (the CLI should be easy to remove)

import cloup
import re
import seqlogo
import numpy as np
from Bio import motifs
import seqlogo
import re

def parse_meme(content):
    mn1 = None
    mn2 = None        

    matches = []

    i = -1
    weights = []
    for line in content.split("\n"):
        if i >= 0:
            i = i+1
        z = re.match("MOTIF (AC.*)\:(.+)\:(.+)\s", line)
        if z:
            i = 0
            mn0 = z.groups()[0]
            mn1 = z.groups()[1]
            mn2 = z.groups()[2]

            print(mn1)
            print(mn2)
        # read "number line"
        if i >=3:            
            weights.append([float(s.strip()) for s in line.split("\t") if s])
            # end of number block
            if not line:
                weights.pop()                
                weights = np.array(weights).transpose()
                print(weights.sum())
                print(weights)
                ppm = seqlogo.Ppm(weights)

                weights = []
                i = -1

@cloup.command()
@cloup.option("--input-jaspar", "-i", type=cloup.File("r"), required=True, help="input (jaspar format)")
def main(input_jaspar):
    parse_meme(input_jaspar.read())

if __name__ == "__main__":
    main()

The data itself is real(!) data, obtained from https://resources.altius.org/~jvierstra/projects/motif-clustering-v2.1beta/.

Expected behavior The tool should be more lenient. The problem is in core.py:71 Why would you ever check for 1e-9 with floating point numbers and the resulting imprecisions? 1e-2 or something similar is way more reasonable. As I understood, the check is just to make sure that the matrix is not in an completely wrong format (like transposed). This threshold still catches this case more than easily.

Desktop (please complete the following information):

betteridiot commented 4 months ago

Hello @Mitmischer,

Thank you for working with seqlogo.

I understand that some of the design choices can be frustrating for some users. Can you show me the workflow that you use that creates the error for you?

As an open-source project, I wholeheartedly welcome PRs so long as the code follows Google Python Style Guide. As stated on the README.md, this project does not support MEME format, but we welcome any additions.

That said, seqlogo was written to support BIOINF 529: Bioinformatics Concepts and Algorithms at the University of Michigan in the Department of Computational Medicine & Bioinformatics. It was intended as a purely educational tool for a very specific module of that course. That is why the rigid tolerance was set. We expected students to generate using the quickstart or similar means during their coursework. However, the tolerance could be set as an optional argument with a default value. That could allow unexpected end-users more flexibility.

If this does not necessarily fit your needs, I can recommend BioPython's Bio.motifs package for parsing motif data and Logomaker for a more robust plotting platform of sequence motifs. Both of these are better suited for production-level analysis. The tradeoff is "ease of use" for the end-user.