althonos / lightmotif

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.
MIT License
41 stars 2 forks source link

Invalid background frequencies #7

Open hannes-brt opened 4 weeks ago

hannes-brt commented 4 weeks ago

Hi,

I'm trying to create PSSMs for proteins with background frequencies.

I make extra sure that my background distribution sums up to 1.0 with:

background_freq = {
    "A": 0.0825,
    "R": 0.0553,
    "N": 0.0406,
    "D": 0.0545,
    "C": 0.0137,
    "Q": 0.0393,
    "E": 0.0675,
    "G": 0.0707,
    "H": 0.0227,
    "I": 0.0596,
    "L": 0.0966,
    "K": 0.0584,
    "M": 0.0242,
    "F": 0.0386,
    "P": 0.0470,
    "S": 0.0656,
    "T": 0.0534,
    "W": 0.0108,
    "Y": 0.0292,
    "V": 0.0687,
    "X": 0.0
}
s = sum(background_freq.values())
background_freq = {aa: f / s for aa, f in background_freq.items()}
background_freq["A"] += 1. - sum(background_freq.values())
assert sum(background_freq.values()) == 1.0

Still, when I try to get a PSSM with this I get a ValueError every time:

>>> pssm = pcm.normalize(pseudocount=0.25).log_odds(background=background_freq)
ValueError: Invalid background frequencies

I've traversed the documentation and looked for usage examples in the unit tests, but I couldn't find anything to help me understand how to correctly pass the background distribution.

Many thanks for the help.

althonos commented 4 weeks ago

Hmm I am being a little bit too stringent with the background: if you check s you'll see it's equal to 0.9989 and not 1, that's why you're getting the error... I was thinking of removing this check eventually though :thinking:

hannes-brt commented 4 weeks ago

Update:

If I create the background as follows, it works:

background_freq = {
    "A": 0.0825,
    "R": 0.0553,
    "N": 0.0406,
    "D": 0.0545,
    "C": 0.0137,
    "Q": 0.0393,
    "E": 0.0675,
    "G": 0.0707,
    "H": 0.0227,
    "I": 0.0596,
    "L": 0.0966,
    "K": 0.0584,
    "M": 0.0242,
    "F": 0.0386,
    "P": 0.0470,
    "S": 0.0656,
    "T": 0.0534,
    "W": 0.0108,
    "Y": 0.0292,
    "V": 0.0687,
}
background_freq["X"] = 1. - sum(background_freq.values())

It looks like values are converted to float32 on the Rust side and then there are errors when using an equality check between two floats in https://github.com/althonos/lightmotif/blob/57aa2862070d64c73a13e3a85fa7e380451e1287/lightmotif/src/abc.rs#L338

hannes-brt commented 4 weeks ago

I think the issue here is that even when the Python values sum up to exactly 1.0 in float64, they no longer do in float32.