Open EdPym opened 11 months ago
Hi @EdPym
Yes, you are totally right. But this is not a bug. The Rel Score is calculated like this:
This means that, roughly, I sum the corresponding probabilities of each nucleotide at each position. So yes, for the same score, there are patterns found which are less relevant than certain or even false. And this is a very good example.
With a PWM more simple: A [ 1.0000 0.5000 0.5000 0.0000 0.0000 ] T [ 0.0000 0.0000 0.0000 0.0000 0.0000 ] G [ 0.0000 0.5000 0.5000 1.0000 0.0000 ] C [ 0.0000 0.0000 0.0000 0.0000 1.0000 ]
For AGGAC: 1 + 0.5 + 0.5 + 0 + 1 = 3 For ATTGC: 1 + 0 + 0 + 1 + 1 = 3
In this example, let's assume that A in position 1 and G in position 4 are obligatory. You see that for AGGACthere is no G at positon 4 and the score is equal to 3. In ATTGC A and G are good but the score is also equal to 3.
I'm working on a way to discriminate this more effectively. I create LCS option. It allows you to look at the number of similar consecutive nucleotides between the pattern found and the PWM. It requires a lot of resources so it is possible that it will crash the software. I am also working on a standalone which will allow us to get rid of Streamlit and have good computing power. But for your example it works. And you will see that ultimately, you may have other more interesting targets. In your example:
Position | Sequence | RelScore | LCS | LCS lenght | LCS RelScore | Strand | Direction -- | -- | -- | -- | -- | -- | -- | -- 463 | gtcAAACTAAAGGACcgg | 0.769912 | TAAAGG | 6 | 0.300885 | + | → 1518 | aatAAATCAGAGCTAaag | 0.769912 | AAATCAGAGC | 10 | 0.747788 | + | →It is important to understand that the RelScore is a global score. The LCS also calculates a RelScore but only on the retained part. So the LCS does a local score.
I just tried to access the alpha, and its says i don't have access.
[cid:1a179e30-e2cd-40b8-aa79-e60c4c0e75f4]
Ed
From: Minniti Julien @.> Sent: Tuesday, November 7, 2023 2:02 PM To: Jumitti/TFinder @.> Cc: Pym, Edward Charles Garswood @.>; Mention @.> Subject: Re: [Jumitti/TFinder] PWM detection issue? [BUG] (Issue #41)
Yes, you are totally right. But this is not a bug. The Rel Score is calculated like this:
This means that, roughly, I sum the corresponding probabilities of each nucleotide at each position. So yes, for the same score, there are patterns found which are less relevant than certain or even false. And this is a very good example.
With a PWM more simple: A [ 1.0000 0.5000 0.5000 0.0000 0.0000 ] T [ 0.0000 0.0000 0.0000 0.0000 0.0000 ] G [ 0.0000 0.5000 0.5000 1.0000 0.0000 ] C [ 0.0000 0.0000 0.0000 0.0000 1.0000 ]
For AGGAC: 1 + 0.5 + 0.5 + 0 + 1 = 3 For ATTGC: 1 + 0 + 0 + 1 + 1 = 3
In this example, let's assume that A in position 1 and G in position 4 are obligatory. You see that for AGGACthere is no G at positon 4 and the score is equal to 3. In ATTGC A and G are good but the score is also equal to 3.
I'm working on a way to discriminate this more effectively. Here is the alpha version of TFinder which allows you to do this: https://jumitti-tfinder-tfinder-v1-alpha-8dr0yn.streamlit.app/https://urldefense.proofpoint.com/v2/url?u=https-3A__jumitti-2Dtfinder-2Dtfinder-2Dv1-2Dalpha-2D8dr0yn.streamlit.app_&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=bytFNpr0nBR8DcosIHUV0TzBKAWjAzNEteKZFVIBrZI&m=OHyDDqm4TJv2ItJKFNhttqvppv-cFaWBSaoyNeievTD-q8gqz9xxgCve-ZnCQF-z&s=II-rQyR-pgSpe0VilQJYk5Di9iL8HoGIHvgvVyemk3U&e=
You must check the LCS option. It allows you to look at the number of similar consecutive nucleotides between the pattern found and the PWM. It requires a lot of resources so it is possible that it will crash the software. I am also working on a standalone which will allow us to get rid of Streamlit and have good computing power. But for your example it works. And you will see that ultimately, you may have other more interesting targets. In your example:
Position Sequence RelScore LCS LCS lenght LCS RelScore Strand Direction 463 gtcAAACTAAAGGACcgg 0.769912 TAAAGG 6 0.300885 + → 1518 aatAAATCAGAGCTAaag 0.769912 AAATCAGAGC 10 0.747788 + →
It is important to understand that the RelScore is a global score. The LCS also calculates a RelScore but only on the retained part. So the LCS does a local score.
— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Jumitti_TFinder_issues_41-23issuecomment-2D1799588759&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=bytFNpr0nBR8DcosIHUV0TzBKAWjAzNEteKZFVIBrZI&m=OHyDDqm4TJv2ItJKFNhttqvppv-cFaWBSaoyNeievTD-q8gqz9xxgCve-ZnCQF-z&s=SKql6oFykji1l5gHVnSUsZzSu_0T6qTPsFgUjJ-LuTk&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AOZH3UNF7P6WSDSNWCTJYALYDKAV3AVCNFSM6AAAAAA7BHXY4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJZGU4DQNZVHE&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=bytFNpr0nBR8DcosIHUV0TzBKAWjAzNEteKZFVIBrZI&m=OHyDDqm4TJv2ItJKFNhttqvppv-cFaWBSaoyNeievTD-q8gqz9xxgCve-ZnCQF-z&s=D_3eFiLj9xAUIxtzU7oQMiesIlkXnHR5aRDTiiqpLEI&e=. You are receiving this because you were mentioned.Message ID: @.***>
Describe the bug Using individual Motif finder it appears to detect binding sites that don't match the PWM.
Here are two results from a search. 1518 | aatAAATCAGAGCTAaag | 0.769912 | + | → | n.d. | n.d | n.d 463 | gtcAAACTAAAGGACcgg | 0.769912 | + | → | n.d. | n.d | n.d
The G (7th Position) and C (10th position) are absolutely required in the PWM. So not sure why site 463 is found?
PWM = MA0451.1