Open ppgardne opened 2 years ago
Hmm, it's definitely a real issue. The markov method has more issues than just that though, since this was just a playing grounds for the final working implementation that ended up in my more serious project (bjmt/universalmotif). I would recommend just not using the method frankly, and I've been considering just deleting it (I don't think it's worth trying to patch it up).
But I suppose I could instead get around to simply transplanting the functioning version into shuffler
. If I get some free time this weekend I'll try and do it.
Thanks for letting me know!
I thought your Markov method was doing reasonably well -- at least for low values of k
. Below are relative entropies for k-mer frequencies for k=1, 2, & 3 (length in the first column). The (near) zeros are from k-mer counts at or near the expected frequency from the native sequence. Things go a bit skew-if for k=6, which admittedly is likely to be a challenge for any method.
AB370334.1.shuffler.eulerian1
shuffler1 0:len:1815/1815:0.00000 1:0.00000 2:0.02125 3:0.06891
shuffler2 0:len:1815/1815:0.00000 1:0.00000 2:0.02496 3:0.07848
shuffler3 0:len:1815/1815:0.00000 1:0.00000 2:0.02436 3:0.07976
shuffler4 0:len:1815/1815:0.00000 1:0.00000 2:0.03701 3:0.11123
shuffler5 0:len:1815/1815:0.00000 1:0.00000 2:0.02810 3:0.08294
shuffler6 0:len:1815/1815:0.00000 1:0.00000 2:0.01717 3:0.07270
shuffler7 0:len:1815/1815:0.00000 1:0.00000 2:0.02054 3:0.07847
shuffler8 0:len:1815/1815:0.00000 1:0.00000 2:0.02065 3:0.08219
shuffler9 0:len:1815/1815:0.00000 1:0.00000 2:0.01972 3:0.06643
shuffler10 0:len:1815/1815:0.00000 1:0.00000 2:0.03382 3:0.10349
shuffler11 0:len:1815/1815:0.00000 1:0.00000 2:0.02294 3:0.08499
AB370334.1.shuffler.eulerian2
shuffler1 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.03314
shuffler2 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.05480
shuffler3 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.03681
shuffler4 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.03181
shuffler5 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.03926
shuffler6 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.02567
shuffler7 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.03918
shuffler8 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.02174
shuffler9 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.02583
shuffler10 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.03272
shuffler11 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.02771
AB370334.1.shuffler.eulerian6
shuffler1 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.00000
shuffler2 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.00000
shuffler3 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.00000
shuffler4 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.00000
shuffler5 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.00000
shuffler6 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.00000
shuffler7 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.00000
shuffler8 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.00000
shuffler9 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.00000
shuffler10 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.00000
shuffler11 0:len:1815/1815:0.00000 1:0.00000 2:0.00000 3:0.00000
AB370334.1.shuffler.linear1
shuffler1 0:len:1815/1815:0.00000 1:0.00000 2:0.02125 3:0.06891
shuffler2 0:len:1815/1815:0.00000 1:0.00000 2:0.02496 3:0.07848
shuffler3 0:len:1815/1815:0.00000 1:0.00000 2:0.02436 3:0.07976
shuffler4 0:len:1815/1815:0.00000 1:0.00000 2:0.03701 3:0.11123
shuffler5 0:len:1815/1815:0.00000 1:0.00000 2:0.02810 3:0.08294
shuffler6 0:len:1815/1815:0.00000 1:0.00000 2:0.01717 3:0.07270
shuffler7 0:len:1815/1815:0.00000 1:0.00000 2:0.02054 3:0.07847
shuffler8 0:len:1815/1815:0.00000 1:0.00000 2:0.02065 3:0.08219
shuffler9 0:len:1815/1815:0.00000 1:0.00000 2:0.01972 3:0.06643
shuffler10 0:len:1815/1815:0.00000 1:0.00000 2:0.03382 3:0.10349
shuffler11 0:len:1815/1815:0.00000 1:0.00000 2:0.02294 3:0.08499
AB370334.1.shuffler.linear2
shuffler1 0:len:1815/1815:0.00000 1:0.00000 2:0.00574 3:0.03757
shuffler2 0:len:1815/1815:0.00000 1:0.00000 2:0.00524 3:0.04548
shuffler3 0:len:1815/1815:0.00000 1:0.00000 2:0.00857 3:0.05018
shuffler4 0:len:1815/1815:0.00000 1:0.00000 2:0.00857 3:0.06539
shuffler5 0:len:1815/1815:0.00000 1:0.00000 2:0.01004 3:0.04764
shuffler6 0:len:1815/1815:0.00000 1:0.00000 2:0.00658 3:0.04822
shuffler7 0:len:1815/1815:0.00000 1:0.00000 2:0.00709 3:0.04325
shuffler8 0:len:1815/1815:0.00000 1:0.00000 2:0.00584 3:0.05383
shuffler9 0:len:1815/1815:0.00000 1:0.00000 2:0.00389 3:0.05175
shuffler10 0:len:1815/1815:0.00000 1:0.00000 2:0.00432 3:0.03693
shuffler11 0:len:1815/1815:0.00000 1:0.00000 2:0.00566 3:0.05194
AB370334.1.shuffler.linear6
shuffler1 0:len:1815/1815:0.00000 1:0.00000 2:0.00129 3:0.01620
shuffler2 0:len:1815/1815:0.00000 1:0.00000 2:0.00133 3:0.01503
shuffler3 0:len:1815/1815:0.00000 1:0.00000 2:0.00093 3:0.01174
shuffler4 0:len:1815/1815:0.00000 1:0.00000 2:0.00204 3:0.01839
shuffler5 0:len:1815/1815:0.00000 1:0.00000 2:0.00242 3:0.01646
shuffler6 0:len:1815/1815:0.00000 1:0.00000 2:0.00176 3:0.01557
shuffler7 0:len:1815/1815:0.00000 1:0.00000 2:0.00069 3:0.01191
shuffler8 0:len:1815/1815:0.00000 1:0.00000 2:0.00229 3:0.01927
shuffler9 0:len:1815/1815:0.00000 1:0.00000 2:0.00088 3:0.01575
shuffler10 0:len:1815/1815:0.00000 1:0.00000 2:0.00249 3:0.01862
shuffler11 0:len:1815/1815:0.00000 1:0.00000 2:0.00044 3:0.01573
AB370334.1.shuffler.markov1
shuffler1 0:len:1815/1815:0.00000 1:0.00000 2:0.02125 3:0.06891
shuffler2 0:len:1815/1815:0.00000 1:0.00000 2:0.02496 3:0.07848
shuffler3 0:len:1815/1815:0.00000 1:0.00000 2:0.02436 3:0.07976
shuffler4 0:len:1815/1815:0.00000 1:0.00000 2:0.03701 3:0.11123
shuffler5 0:len:1815/1815:0.00000 1:0.00000 2:0.02810 3:0.08294
shuffler6 0:len:1815/1815:0.00000 1:0.00000 2:0.01717 3:0.07270
shuffler7 0:len:1815/1815:0.00000 1:0.00000 2:0.02054 3:0.07847
shuffler8 0:len:1815/1815:0.00000 1:0.00000 2:0.02065 3:0.08219
shuffler9 0:len:1815/1815:0.00000 1:0.00000 2:0.01972 3:0.06643
shuffler10 0:len:1815/1815:0.00000 1:0.00000 2:0.03382 3:0.10349
shuffler11 0:len:1815/1815:0.00000 1:0.00000 2:0.02294 3:0.08499
AB370334.1.shuffler.markov2
shuffler1 0:len:1815/1814:-0.00596 1:0.00241 2:0.00858 3:0.05670
shuffler2 0:len:1815/1814:-0.00596 1:0.00030 2:0.00468 3:0.03325
shuffler3 0:len:1815/1814:-0.00596 1:0.00534 2:0.01430 3:0.06311
shuffler4 0:len:1815/1814:-0.00596 1:0.00110 2:0.00420 3:0.04075
shuffler5 0:len:1815/1814:-0.00596 1:0.00131 2:0.00670 3:0.04310
shuffler6 0:len:1815/1814:-0.00596 1:0.00020 2:0.00266 3:0.04376
shuffler7 0:len:1815/1814:-0.00596 1:0.00172 2:0.00550 3:0.04951
shuffler8 0:len:1815/1814:-0.00596 1:0.00064 2:0.00436 3:0.05107
shuffler9 0:len:1815/1814:-0.00596 1:0.00306 2:0.00809 3:0.04997
shuffler10 0:len:1815/1814:-0.00596 1:0.00050 2:0.00411 3:0.04215
shuffler11 0:len:1815/1814:-0.00596 1:0.00093 2:0.00510 3:0.03392
AB370334.1.shuffler.markov6
shuffler1 0:len:1815/1814:-0.00596 1:0.17365 2:0.37431 3:0.60786
shuffler2 0:len:1815/1814:-0.00596 1:0.22635 2:0.44393 3:0.68462
shuffler3 0:len:1815/1814:-0.00596 1:0.17183 2:0.39055 3:0.64508
shuffler4 0:len:1815/1814:-0.00596 1:0.19547 2:0.40989 3:0.68326
shuffler5 0:len:1815/1814:-0.00596 1:0.22546 2:0.47043 3:0.73079
shuffler6 0:len:1815/1814:-0.00596 1:0.16121 2:0.37768 3:0.63903
shuffler7 0:len:1815/1814:-0.00596 1:0.18665 2:0.40878 3:0.67257
shuffler8 0:len:1815/1814:-0.00596 1:0.15294 2:0.31107 3:0.52973
shuffler9 0:len:1815/1814:-0.00596 1:0.19034 2:0.40117 3:0.63769
shuffler10 0:len:1815/1814:-0.00596 1:0.20263 2:0.42391 3:0.67148
shuffler11 0:len:1815/1814:-0.00596 1:0.15191 2:0.33647 3:0.53626
Should be fixed! Give it a shot whenever you get a chance.
Sorry, took me a while to back to this project. The off-by-one seems to be fixed. Thanks. I did spot another slight issue -- which you may have alluded to above. When I use a high "k" I can get strings of "A"s in return. E.g.
~/inst/sequence-utils/bin/shuffler -n 10 -f -k6 -m -i ./data/sequences/LM608609.1.fasta -o blah && head -n 50 ./data/sequences/LM608609.1.fasta blah
==> ./data/sequences/LM608609.1.fasta <==
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor
TGGGAAACATACTTCTTTATATGCCCATATGGACCTGCTAAGCTATGGAATGTAAAGAAG
TATGTATCTCA
==> blah <==
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor
GTATCTTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-1
ACATACCTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-2
GGACCTTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-3
CATACTTTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-4
GGAAACCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-5
GCTAAGGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-6
CATACTTTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-7
TATGGAAATCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-8
TGTATCCTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-9
ATATGGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-10
CCCATAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Seems to kick in at k=5. Which is certainly pushing the limits of what one should normally do with a Markov method and a short sequence.
Kia ora Ben,
I just noticed with the
shuffler -m
option the output sequences are all shorter than the input sequence by 1 nucleotide for-k
values>1
. Is that an expected feature or a genuine issue?Best wishes, Paul.