Interpreting the score for a sequence scan with searchSeq

SalvatoreRa commented 6 years ago

Hi,

I scanned a nucleotide sequence with a PWM pattern and I obtained the score and the p-values.

I copied this example just changing the transcription factor and the DNA sequence used:

example

library(Biostrings) data(MA0004.1) subject <- DNAString("GAATTCTCTCTTGTTGTAGTCTCTTGACAAAATG") siteset <- searchSeq(pwm, subject, seqname="seq1", min.score="60%", strand="*")

I calculated the score

head(writeGFF3(siteset)) relScore(siteset)

I calculate the pValue

pvalues(siteset, type="TFMPvalue") pvalues(siteset, type="sampling")

I obtained around 50 sequences with a score from -7 to 5, I would like to know how to interpeter this score, how to choose the best matches. Which is the best score? the highest score? the most negative? there is a theresold to consider? or I should consider the relative score? which of the two pvalue method for you is the most accurate?

Thank you for your help,

Salvo

ge11232002 commented 6 years ago

Hi Salvo,

The higher absolute score in siteset, the better match. The relative score is calculated by absolute score / maximal score from the profile. The min.score sets the minimal relative score for the results.

The p-value from TFMPvalue is more accurate, however, slower to computer in some cases.

For more information, please refer to

Wasserman, W. W., & Sandelin, A. (2004). Applied bioinformatics
for the identification of regulatory elements. Nature Publishing
Group, 5(4), 276-287.  doi:10.1038/nrg1315

Ge

SalvatoreRa commented 6 years ago

Thank you very much

ge11232002 / TFBSTools