Closed yuji3w closed 4 years ago
OK, thanks for submitting this - and for bearing with me while I got around to investigating!
Short answer: you need to use --generate-uniform
(which puts a uniform probability distribution over the flanking symbols) instead of --generate-wild
(which does not). I find that it works correctly if you make this switch.
Longer answer: What --counts
and --count-copies
actually do is give you the expected number of times (in the statistical sense, considering the posterior probability distribution over paths through the machine, conditioned on the observed inputs & outputs) that the machine used the given sub-machine (i.e. the motif you're trying to count). This answer only makes sense if paths through the machine are probabilistically normalized, so that the motif being generated by the sub-machine (at essentially zero information cost, since it's hard-coded into the machine) is a more plausible/likely explanation of the motif than it being generated by the flanking states (where each nucleotide pays a 2-bit cost). With --generate-wild
you are not penalizing alignment of sequence the flanking states -- there is no 2-bit cost for each nucleotide that aligns to the flanking states. So in principle the entire sequence can be aligned to the flanking states and this is as entirely "plausible" an explanation of the sequence as any other. Statistically, then, it's only a 50/50 chance that the motif was generated by the sub-machine, vs being generated by the flanking states. That is what is going on here, I think. (It also explains why for very short motifs like TTTT
the count is still not exactly 1, it's 1/(1 + 4^(-K)) where K is the number of nucleotides in the motif, e.g. for TTTT
we have K=4 and the answer given by the following command is 0.996109 instead of 1
boss --counts -v6 \ --generate-uniform ACGT \ --concat \ --begin \ --generate-chars TTTT \ --concat --generate-uniform ACGT \ --count-copies n \ --end \ --recognize-chars TTTT
When I run this command:
boss --counts -v6 \ --generate-wild ACGT \ --concat \ --begin \ --generate-chars TTTT \ --concat --generate-wild ACGT \ --count-copies n \ --end \ --recognize-fasta test.fasta
on the following string in test.fasta:
AGCTTTTCATTCTGTTTTACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
the variable n is half of what I expect. Is this expected behavior?