evoldoers / machineboss

Bioinformatics Open Source Sequence machine
BSD 3-Clause "New" or "Revised" License
33 stars 7 forks source link

Counts returns half of expected value #108

Closed yuji3w closed 4 years ago

yuji3w commented 4 years ago

When I run this command:

boss --counts -v6 \ --generate-wild ACGT \ --concat \ --begin \ --generate-chars TTTT \ --concat --generate-wild ACGT \ --count-copies n \ --end \ --recognize-fasta test.fasta

on the following string in test.fasta:

AGCTTTTCATTCTGTTTTACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC

the variable n is half of what I expect. Is this expected behavior?

ihh commented 4 years ago

OK, thanks for submitting this - and for bearing with me while I got around to investigating! Short answer: you need to use --generate-uniform (which puts a uniform probability distribution over the flanking symbols) instead of --generate-wild (which does not). I find that it works correctly if you make this switch.

Longer answer: What --counts and --count-copies actually do is give you the expected number of times (in the statistical sense, considering the posterior probability distribution over paths through the machine, conditioned on the observed inputs & outputs) that the machine used the given sub-machine (i.e. the motif you're trying to count). This answer only makes sense if paths through the machine are probabilistically normalized, so that the motif being generated by the sub-machine (at essentially zero information cost, since it's hard-coded into the machine) is a more plausible/likely explanation of the motif than it being generated by the flanking states (where each nucleotide pays a 2-bit cost). With --generate-wild you are not penalizing alignment of sequence the flanking states  -- there is no 2-bit cost for each nucleotide that aligns to the flanking states. So in principle the entire sequence can be aligned to the flanking states and this is as entirely "plausible" an explanation of the sequence as any other. Statistically, then, it's only a 50/50 chance that the motif was generated by the sub-machine, vs being generated by the flanking states. That is what is going on here, I think. (It also explains why for very short motifs like TTTT the count is still not exactly 1, it's 1/(1 + 4^(-K)) where K is the number of nucleotides in the motif, e.g. for TTTT we have K=4 and the answer given by the following command is 0.996109 instead of 1

boss --counts -v6 \
 --generate-uniform ACGT \
 --concat \
  --begin \
   --generate-chars TTTT \
   --concat --generate-uniform ACGT \
   --count-copies n \
  --end \
  --recognize-chars TTTT