Jumitti / TFinder

Python script to quickly extract promoter and terminator regions with the NCBI API. Search for the presence of individual pattern or transcription factor responsive elements with manual sequence (IUPAC) or JASPAR API.
https://tfinder-ipmc.streamlit.app/
MIT License
9 stars 15 forks source link

PWM detection issue? [HELP] [QUESTION] #41

Open EdPym opened 11 months ago

EdPym commented 11 months ago

Describe the bug Using individual Motif finder it appears to detect binding sites that don't match the PWM.

Here are two results from a search. 1518 | aatAAATCAGAGCTAaag | 0.769912 | + | → | n.d. | n.d | n.d 463 | gtcAAACTAAAGGACcgg | 0.769912 | + | → | n.d. | n.d | n.d

The G (7th Position) and C (10th position) are absolutely required in the PWM. So not sure why site 463 is found?

PWM = MA0451.1

Seq atatcccaaggccgcaaagtcaacaagtcggcagcaaatttccctttgtccggcgatgtgttttttttttagccataactcgctgcattgtttgggccaagtttttcttctgccaaattgcggagatgatgcggggattatgcgctgattgcgtgcaattatggacatcctgcgaggccccgaggaacttcctgctaaatcctttcatccgcctacagaacccctttgtgtcccgttcgccgggagtccttgacgggtccttcgactattcgcttacagcagcttgcgtaaaatttcataaccctacgagcggctcttccgcggaatccctggcattatcctttttacctcttgccaatccgttggctaaaaaacggcttcgacttccgcgtaactgctggacaacaaagacaaaaaacggcgaaaggacggcgatttccaggtagcattgcgaattccgtcaaactaaaggaccggttatataacgggtttatatggccagaatctctgcatctccacgaccgccagaagctgcgtaaaactgcaggctctgttttgatttctgcaacttcagttaattgcccgggatggccagcaattgccggcaattataaaacagcgcagatgtgactcagcttccatatctaactctatatctcatgccgaaaatcGagggtggggagcggaggggcggggtgcgtgggtgacttgcctgccagggaaagggggcgggggttcagcgggtgataaatgtgcgtgatttggaatgaatgcgcatcgattaaaaccgcagggcaatcaatttagcgccttttacgccaaattggctcgtacacaaccaattaatgtcagcgggtgaactgacaccatcgcccaccaccgcatcccccttCcccctgttggccatccacccccgaaaaacaattacaacaacgaagacaagcagagggactgctgcagattccgctcaataaacctccaataaagcgaatccagcgtgaggcgtcgacgtctaattgctgttaactcgtcaactaggagaacgctccatcctcgccgttgtgcggctccttggacgcctgattaaacggattggagatgcgaggtgtacagtcgagcctccgtaagggcaaccaaaagtaaaaaacatcgactatttgaaatacaaagttttatatgtacatataatttatcaggctccggatgtaacttaattaaaacatttccttttcataaaatattgctagctgatagctgctcaaaagaacaataaaggtaataaattatgtttgcttgcaaacaattttcaatcaaaaaagtatgcgttccatcttagttaataattaattacctggataaagacttttgaaacatatcatagcgtttctttgcatattcaatactaaccaattttttataaatgAagttacaccgtttgtcgtcttgtcaagtagtatcttcacaataagtataatacagaatcaagatagtaaaataaaacaaaaaaCcgtgtgaataaatcagagctaaagacgtcggac

Jumitti commented 11 months ago

Hi @EdPym

Yes, you are totally right. But this is not a bug. The Rel Score is calculated like this:

relscore equation

This means that, roughly, I sum the corresponding probabilities of each nucleotide at each position. So yes, for the same score, there are patterns found which are less relevant than certain or even false. And this is a very good example.

With a PWM more simple: A [ 1.0000 0.5000 0.5000 0.0000 0.0000 ] T [ 0.0000 0.0000 0.0000 0.0000 0.0000 ] G [ 0.0000 0.5000 0.5000 1.0000 0.0000 ] C [ 0.0000 0.0000 0.0000 0.0000 1.0000 ]

For AGGAC: 1 + 0.5 + 0.5 + 0 + 1 = 3 For ATTGC: 1 + 0 + 0 + 1 + 1 = 3

In this example, let's assume that A in position 1 and G in position 4 are obligatory. You see that for AGGACthere is no G at positon 4 and the score is equal to 3. In ATTGC A and G are good but the score is also equal to 3.

I'm working on a way to discriminate this more effectively. I create LCS option. It allows you to look at the number of similar consecutive nucleotides between the pattern found and the PWM. It requires a lot of resources so it is possible that it will crash the software. I am also working on a standalone which will allow us to get rid of Streamlit and have good computing power. But for your example it works. And you will see that ultimately, you may have other more interesting targets. In your example:

Position | Sequence | RelScore | LCS | LCS lenght | LCS RelScore | Strand | Direction -- | -- | -- | -- | -- | -- | -- | -- 463 | gtcAAACTAAAGGACcgg | 0.769912 | TAAAGG | 6 | 0.300885 | + | → 1518 | aatAAATCAGAGCTAaag | 0.769912 | AAATCAGAGC | 10 | 0.747788 | + | →

It is important to understand that the RelScore is a global score. The LCS also calculates a RelScore but only on the retained part. So the LCS does a local score.

EdPym commented 11 months ago

I just tried to access the alpha, and its says i don't have access.

[cid:1a179e30-e2cd-40b8-aa79-e60c4c0e75f4]

Ed


From: Minniti Julien @.> Sent: Tuesday, November 7, 2023 2:02 PM To: Jumitti/TFinder @.> Cc: Pym, Edward Charles Garswood @.>; Mention @.> Subject: Re: [Jumitti/TFinder] PWM detection issue? [BUG] (Issue #41)

Hi @EdPymhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_EdPym&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=bytFNpr0nBR8DcosIHUV0TzBKAWjAzNEteKZFVIBrZI&m=OHyDDqm4TJv2ItJKFNhttqvppv-cFaWBSaoyNeievTD-q8gqz9xxgCve-ZnCQF-z&s=2n8PyfWClFLBh4s0pormp-xPgc36MC4G4v-tiHbnWpU&e=

Yes, you are totally right. But this is not a bug. The Rel Score is calculated like this:

[relscore equation]https://urldefense.proofpoint.com/v2/url?u=https-3A__camo.githubusercontent.com_13ac92b2e1cb58dfe55acb66960c66308782889104224b11d9125eba90a59e87_68747470733a2f2f6c617465782e636f6465636f67732e636f6d2f7376672e696d6167653f7b5c636f6c6f727b7265647d5c746578747b52656c61746976652673706163653b53636f72657d3d5c667261637b5c746578747b53636f72652673706163653b6f662673706163653b7468652673706163653b656c656d656e742673706163653b666f756e647d2d5c746578747b4d696e696d756d2673706163653b73636f72652673706163653b6f662673706163653b7468652673706163653b7265666572656e63652673706163653b6d61747269787d7d7b5c746578747b4d6178696d756d2673706163653b73636f72652673706163653b6f662673706163653b7468652673706163653b7265666572656e63652673706163653b6d61747269787d2d5c746578747b4d696e696d756d2673706163653b73636f72652673706163653b6f662673706163653b7468652673706163653b7265666572656e63652673706163653b6d61747269787d7d7d&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=bytFNpr0nBR8DcosIHUV0TzBKAWjAzNEteKZFVIBrZI&m=OHyDDqm4TJv2ItJKFNhttqvppv-cFaWBSaoyNeievTD-q8gqz9xxgCve-ZnCQF-z&s=UOax9c2xfxrrR85E0yV9v8tbeUJq98bAlyyeia4a6gM&e=

This means that, roughly, I sum the corresponding probabilities of each nucleotide at each position. So yes, for the same score, there are patterns found which are less relevant than certain or even false. And this is a very good example.

With a PWM more simple: A [ 1.0000 0.5000 0.5000 0.0000 0.0000 ] T [ 0.0000 0.0000 0.0000 0.0000 0.0000 ] G [ 0.0000 0.5000 0.5000 1.0000 0.0000 ] C [ 0.0000 0.0000 0.0000 0.0000 1.0000 ]

For AGGAC: 1 + 0.5 + 0.5 + 0 + 1 = 3 For ATTGC: 1 + 0 + 0 + 1 + 1 = 3

In this example, let's assume that A in position 1 and G in position 4 are obligatory. You see that for AGGACthere is no G at positon 4 and the score is equal to 3. In ATTGC A and G are good but the score is also equal to 3.

I'm working on a way to discriminate this more effectively. Here is the alpha version of TFinder which allows you to do this: https://jumitti-tfinder-tfinder-v1-alpha-8dr0yn.streamlit.app/https://urldefense.proofpoint.com/v2/url?u=https-3A__jumitti-2Dtfinder-2Dtfinder-2Dv1-2Dalpha-2D8dr0yn.streamlit.app_&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=bytFNpr0nBR8DcosIHUV0TzBKAWjAzNEteKZFVIBrZI&m=OHyDDqm4TJv2ItJKFNhttqvppv-cFaWBSaoyNeievTD-q8gqz9xxgCve-ZnCQF-z&s=II-rQyR-pgSpe0VilQJYk5Di9iL8HoGIHvgvVyemk3U&e=

You must check the LCS option. It allows you to look at the number of similar consecutive nucleotides between the pattern found and the PWM. It requires a lot of resources so it is possible that it will crash the software. I am also working on a standalone which will allow us to get rid of Streamlit and have good computing power. But for your example it works. And you will see that ultimately, you may have other more interesting targets. In your example:

Position Sequence RelScore LCS LCS lenght LCS RelScore Strand Direction 463 gtcAAACTAAAGGACcgg 0.769912 TAAAGG 6 0.300885 + → 1518 aatAAATCAGAGCTAaag 0.769912 AAATCAGAGC 10 0.747788 + →

It is important to understand that the RelScore is a global score. The LCS also calculates a RelScore but only on the retained part. So the LCS does a local score.

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Jumitti_TFinder_issues_41-23issuecomment-2D1799588759&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=bytFNpr0nBR8DcosIHUV0TzBKAWjAzNEteKZFVIBrZI&m=OHyDDqm4TJv2ItJKFNhttqvppv-cFaWBSaoyNeievTD-q8gqz9xxgCve-ZnCQF-z&s=SKql6oFykji1l5gHVnSUsZzSu_0T6qTPsFgUjJ-LuTk&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AOZH3UNF7P6WSDSNWCTJYALYDKAV3AVCNFSM6AAAAAA7BHXY4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJZGU4DQNZVHE&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=bytFNpr0nBR8DcosIHUV0TzBKAWjAzNEteKZFVIBrZI&m=OHyDDqm4TJv2ItJKFNhttqvppv-cFaWBSaoyNeievTD-q8gqz9xxgCve-ZnCQF-z&s=D_3eFiLj9xAUIxtzU7oQMiesIlkXnHR5aRDTiiqpLEI&e=. You are receiving this because you were mentioned.Message ID: @.***>