Simon-Coetzee / motifBreakR

A Package For Predicting The Disruptiveness Of Single Nucleotide Polymorphisms On Transcription Factor Binding Sites.
27 stars 12 forks source link

Odd Motif Positions #49

Open lotard opened 1 year ago

lotard commented 1 year ago

tl;dr: I find many motifPos that imply TF binds completely outside the variant range (start AND end are negative), but also it starts after it ends (end < start)? Also, motif length != end-start.

I got this result from running motifBreakeR on a list of variants that are 12bp substitution (a bit odd, maybe why this happens):

chr8    48582443    48582454    12  +   mutant  CTGTTGTTATTC    TCACCACCGCCT    Other   c(-6, -9)   MSANTD3 jaspar2022  MA1523.1    MA1523.1    gctctccacTCACCACCGCCTtggtcccaa  0.6356006   0.9892791   4.870919    7.465608    NA  NA  1:12    2.5946893   0.34392895  strong

Note motifPos: c(-6,9)

I think start is correct (in this and similar cases), but end should simply be start+motifLen.

It seems the bug lies here:

  if ((mresult$varType == "Insertion" & score < 0) |
      (mresult$varType == "Deletion" & score > 0)) {
    motif.end <- motif.start + len
  } else {
    if (motif.start > 0) {
      motif.end <- len - length(motif.start:length(alt_loc[1]:alt_loc[2]))
    } else {
      motif.end <- motif.start + len - length(alt_loc[1]:alt_loc[2])
    }
  }

I'm not completely sure what is calculated under else, but it results in this weird bug. First case (motif.start+len) would work well in here. I can't think of a case when, after establishing where motif starts in reference, the end wouldn't be motif.start+len. In any case may be worth throwing a warning if start>=end.

lotard commented 1 year ago

Now I'm not sure start is calculated correctly either.

E.g. Zfp410 motif below allegedly starts on the base before the variant and ends on position 15 (-1,15). This would make alt seq aTGGCCGAGATGGgat. But 1) motif length is 17bp, not 16bp and 2) best match for Zfp410 to alt is (5,21) CGAGATGGgatgatagg, which perfectly matches UniPROBE top k-mer ATGGGATG. So something is right, but something is wrong.

 seqnames    start      end width strand  SNP_id          REF          ALT varType motifPos geneSymbol
     chr8 48582575 48582586    12      + LC14t17 CAATTAGAGCAA TGGCCGAGATGG   Other   -1, 15     Zfp410
  dataSource                 providerName providerId
   UniPROBE SCI09/Zfp410_pwm_primary.txt    UP00033
                                                                    seqMatch    pctRef    pctAlt scoreRef
                   gaatctggaaaTGGCCGAGATGGgatgataggc                       0.5466476 0.9525102 6.257263
  scoreAlt Refpvalue Altpvalue altPos alleleDiff alleleEffectSize effect
 10.66372        NA        NA      6   4.406453        0.3941613 strong

image

Simon-Coetzee commented 1 year ago

Firstly, yes you have exposed a bug, particularly on the Zfp410 example.

Your first example, however, I believe is working as intended.

I think I have it confusingly labeled, however because of the conditions that I am trying to describe, I'm not sure if I could do it differently.

The first number is the number of bases upstream (negative) or downstream (positive) of the start of the variant describing where the motif starts. The second number is the number of bases upstream or downstream of the end of the variant, describing where the motif ends.

image

In this figure, the motif is represented by the pink rectangle, and the insertion by the green rectangle, and the motif position is recorded to the right.

In your first example c(-6, -9) the motif is six bp upstream from the start of the 12mer, and nine bp upstream from the end of the 12mer.

In your second example the two 12mers have a 3bp area of similarity and when I try to align the sequences to find out exactly where the motif is breaking the reference sequence, the alignment is produced like this:

Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: TGGCC-GAGATGG
subject: CAATTAGAG-CAA
score: -69.24901 

a deletion at 6bp and an insertion at 10bp, rather than failing as intended and marking all 12bp a substitution.

For this reason, its marking the position relative to this false 1bp deletion, rather than the 12bp substitution.

Sorry for the difficulty, I'll try get this sorted out ASAP.

lotard commented 1 year ago

Thanks for clarifying and sorry for bundling two issues in one! I see how this notation makes sense wrt indels. The figure you've attached does a great job of explaining that, could be useful having it in the vignette.

lotard commented 1 year ago

So it seems this is still not working as of motifbreakR_2.13.8. Note: same motif, different mutations (adjacent), same ref motif score and pctRef detected, but first one is 12bp long (7,6) and second is 23bp? (-10,1)

image

I've made a reprex, attached.

reprex_klf4.zip