bjmt / sequence-utils

Sequence shuffler and other sequence-related utilities.
GNU General Public License v3.0
1 stars 0 forks source link

Markov lengths off-by-one... #3

Open ppgardne opened 2 years ago

ppgardne commented 2 years ago

Kia ora Ben,

I just noticed with the shuffler -m option the output sequences are all shorter than the input sequence by 1 nucleotide for -k values >1. Is that an expected feature or a genuine issue?

Best wishes, Paul.

bjmt commented 2 years ago

Hmm, it's definitely a real issue. The markov method has more issues than just that though, since this was just a playing grounds for the final working implementation that ended up in my more serious project (bjmt/universalmotif). I would recommend just not using the method frankly, and I've been considering just deleting it (I don't think it's worth trying to patch it up).

But I suppose I could instead get around to simply transplanting the functioning version into shuffler. If I get some free time this weekend I'll try and do it.

Thanks for letting me know!

ppgardne commented 2 years ago

I thought your Markov method was doing reasonably well -- at least for low values of k. Below are relative entropies for k-mer frequencies for k=1, 2, & 3 (length in the first column). The (near) zeros are from k-mer counts at or near the expected frequency from the native sequence. Things go a bit skew-if for k=6, which admittedly is likely to be a challenge for any method.

AB370334.1.shuffler.eulerian1
shuffler1   0:len:1815/1815:0.00000 1:0.00000   2:0.02125   3:0.06891   
shuffler2   0:len:1815/1815:0.00000 1:0.00000   2:0.02496   3:0.07848   
shuffler3   0:len:1815/1815:0.00000 1:0.00000   2:0.02436   3:0.07976   
shuffler4   0:len:1815/1815:0.00000 1:0.00000   2:0.03701   3:0.11123   
shuffler5   0:len:1815/1815:0.00000 1:0.00000   2:0.02810   3:0.08294   
shuffler6   0:len:1815/1815:0.00000 1:0.00000   2:0.01717   3:0.07270   
shuffler7   0:len:1815/1815:0.00000 1:0.00000   2:0.02054   3:0.07847   
shuffler8   0:len:1815/1815:0.00000 1:0.00000   2:0.02065   3:0.08219   
shuffler9   0:len:1815/1815:0.00000 1:0.00000   2:0.01972   3:0.06643   
shuffler10  0:len:1815/1815:0.00000 1:0.00000   2:0.03382   3:0.10349   
shuffler11  0:len:1815/1815:0.00000 1:0.00000   2:0.02294   3:0.08499   
AB370334.1.shuffler.eulerian2
shuffler1   0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.03314   
shuffler2   0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.05480   
shuffler3   0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.03681   
shuffler4   0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.03181   
shuffler5   0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.03926   
shuffler6   0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.02567   
shuffler7   0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.03918   
shuffler8   0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.02174   
shuffler9   0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.02583   
shuffler10  0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.03272   
shuffler11  0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.02771   
AB370334.1.shuffler.eulerian6
shuffler1   0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.00000   
shuffler2   0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.00000   
shuffler3   0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.00000   
shuffler4   0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.00000   
shuffler5   0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.00000   
shuffler6   0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.00000   
shuffler7   0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.00000   
shuffler8   0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.00000   
shuffler9   0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.00000   
shuffler10  0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.00000   
shuffler11  0:len:1815/1815:0.00000 1:0.00000   2:0.00000   3:0.00000   
AB370334.1.shuffler.linear1
shuffler1   0:len:1815/1815:0.00000 1:0.00000   2:0.02125   3:0.06891   
shuffler2   0:len:1815/1815:0.00000 1:0.00000   2:0.02496   3:0.07848   
shuffler3   0:len:1815/1815:0.00000 1:0.00000   2:0.02436   3:0.07976   
shuffler4   0:len:1815/1815:0.00000 1:0.00000   2:0.03701   3:0.11123   
shuffler5   0:len:1815/1815:0.00000 1:0.00000   2:0.02810   3:0.08294   
shuffler6   0:len:1815/1815:0.00000 1:0.00000   2:0.01717   3:0.07270   
shuffler7   0:len:1815/1815:0.00000 1:0.00000   2:0.02054   3:0.07847   
shuffler8   0:len:1815/1815:0.00000 1:0.00000   2:0.02065   3:0.08219   
shuffler9   0:len:1815/1815:0.00000 1:0.00000   2:0.01972   3:0.06643   
shuffler10  0:len:1815/1815:0.00000 1:0.00000   2:0.03382   3:0.10349   
shuffler11  0:len:1815/1815:0.00000 1:0.00000   2:0.02294   3:0.08499   
AB370334.1.shuffler.linear2
shuffler1   0:len:1815/1815:0.00000 1:0.00000   2:0.00574   3:0.03757   
shuffler2   0:len:1815/1815:0.00000 1:0.00000   2:0.00524   3:0.04548   
shuffler3   0:len:1815/1815:0.00000 1:0.00000   2:0.00857   3:0.05018   
shuffler4   0:len:1815/1815:0.00000 1:0.00000   2:0.00857   3:0.06539   
shuffler5   0:len:1815/1815:0.00000 1:0.00000   2:0.01004   3:0.04764   
shuffler6   0:len:1815/1815:0.00000 1:0.00000   2:0.00658   3:0.04822   
shuffler7   0:len:1815/1815:0.00000 1:0.00000   2:0.00709   3:0.04325   
shuffler8   0:len:1815/1815:0.00000 1:0.00000   2:0.00584   3:0.05383   
shuffler9   0:len:1815/1815:0.00000 1:0.00000   2:0.00389   3:0.05175   
shuffler10  0:len:1815/1815:0.00000 1:0.00000   2:0.00432   3:0.03693   
shuffler11  0:len:1815/1815:0.00000 1:0.00000   2:0.00566   3:0.05194   
AB370334.1.shuffler.linear6
shuffler1   0:len:1815/1815:0.00000 1:0.00000   2:0.00129   3:0.01620   
shuffler2   0:len:1815/1815:0.00000 1:0.00000   2:0.00133   3:0.01503   
shuffler3   0:len:1815/1815:0.00000 1:0.00000   2:0.00093   3:0.01174   
shuffler4   0:len:1815/1815:0.00000 1:0.00000   2:0.00204   3:0.01839   
shuffler5   0:len:1815/1815:0.00000 1:0.00000   2:0.00242   3:0.01646   
shuffler6   0:len:1815/1815:0.00000 1:0.00000   2:0.00176   3:0.01557   
shuffler7   0:len:1815/1815:0.00000 1:0.00000   2:0.00069   3:0.01191   
shuffler8   0:len:1815/1815:0.00000 1:0.00000   2:0.00229   3:0.01927   
shuffler9   0:len:1815/1815:0.00000 1:0.00000   2:0.00088   3:0.01575   
shuffler10  0:len:1815/1815:0.00000 1:0.00000   2:0.00249   3:0.01862   
shuffler11  0:len:1815/1815:0.00000 1:0.00000   2:0.00044   3:0.01573   
AB370334.1.shuffler.markov1
shuffler1   0:len:1815/1815:0.00000 1:0.00000   2:0.02125   3:0.06891   
shuffler2   0:len:1815/1815:0.00000 1:0.00000   2:0.02496   3:0.07848   
shuffler3   0:len:1815/1815:0.00000 1:0.00000   2:0.02436   3:0.07976   
shuffler4   0:len:1815/1815:0.00000 1:0.00000   2:0.03701   3:0.11123   
shuffler5   0:len:1815/1815:0.00000 1:0.00000   2:0.02810   3:0.08294   
shuffler6   0:len:1815/1815:0.00000 1:0.00000   2:0.01717   3:0.07270   
shuffler7   0:len:1815/1815:0.00000 1:0.00000   2:0.02054   3:0.07847   
shuffler8   0:len:1815/1815:0.00000 1:0.00000   2:0.02065   3:0.08219   
shuffler9   0:len:1815/1815:0.00000 1:0.00000   2:0.01972   3:0.06643   
shuffler10  0:len:1815/1815:0.00000 1:0.00000   2:0.03382   3:0.10349   
shuffler11  0:len:1815/1815:0.00000 1:0.00000   2:0.02294   3:0.08499   
AB370334.1.shuffler.markov2
shuffler1   0:len:1815/1814:-0.00596    1:0.00241   2:0.00858   3:0.05670   
shuffler2   0:len:1815/1814:-0.00596    1:0.00030   2:0.00468   3:0.03325   
shuffler3   0:len:1815/1814:-0.00596    1:0.00534   2:0.01430   3:0.06311   
shuffler4   0:len:1815/1814:-0.00596    1:0.00110   2:0.00420   3:0.04075   
shuffler5   0:len:1815/1814:-0.00596    1:0.00131   2:0.00670   3:0.04310   
shuffler6   0:len:1815/1814:-0.00596    1:0.00020   2:0.00266   3:0.04376   
shuffler7   0:len:1815/1814:-0.00596    1:0.00172   2:0.00550   3:0.04951   
shuffler8   0:len:1815/1814:-0.00596    1:0.00064   2:0.00436   3:0.05107   
shuffler9   0:len:1815/1814:-0.00596    1:0.00306   2:0.00809   3:0.04997   
shuffler10  0:len:1815/1814:-0.00596    1:0.00050   2:0.00411   3:0.04215   
shuffler11  0:len:1815/1814:-0.00596    1:0.00093   2:0.00510   3:0.03392   
AB370334.1.shuffler.markov6
shuffler1   0:len:1815/1814:-0.00596    1:0.17365   2:0.37431   3:0.60786   
shuffler2   0:len:1815/1814:-0.00596    1:0.22635   2:0.44393   3:0.68462   
shuffler3   0:len:1815/1814:-0.00596    1:0.17183   2:0.39055   3:0.64508   
shuffler4   0:len:1815/1814:-0.00596    1:0.19547   2:0.40989   3:0.68326   
shuffler5   0:len:1815/1814:-0.00596    1:0.22546   2:0.47043   3:0.73079   
shuffler6   0:len:1815/1814:-0.00596    1:0.16121   2:0.37768   3:0.63903   
shuffler7   0:len:1815/1814:-0.00596    1:0.18665   2:0.40878   3:0.67257   
shuffler8   0:len:1815/1814:-0.00596    1:0.15294   2:0.31107   3:0.52973   
shuffler9   0:len:1815/1814:-0.00596    1:0.19034   2:0.40117   3:0.63769   
shuffler10  0:len:1815/1814:-0.00596    1:0.20263   2:0.42391   3:0.67148   
shuffler11  0:len:1815/1814:-0.00596    1:0.15191   2:0.33647   3:0.53626   
bjmt commented 2 years ago

Should be fixed! Give it a shot whenever you get a chance.

ppgardne commented 2 years ago

Sorry, took me a while to back to this project. The off-by-one seems to be fixed. Thanks. I did spot another slight issue -- which you may have alluded to above. When I use a high "k" I can get strings of "A"s in return. E.g.

~/inst/sequence-utils/bin/shuffler -n 10 -f -k6 -m  -i  ./data/sequences/LM608609.1.fasta  -o blah && head -n 50 ./data/sequences/LM608609.1.fasta  blah
==> ./data/sequences/LM608609.1.fasta <==
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor
TGGGAAACATACTTCTTTATATGCCCATATGGACCTGCTAAGCTATGGAATGTAAAGAAG
TATGTATCTCA

==> blah <==
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor
GTATCTTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-1
ACATACCTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-2
GGACCTTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-3
CATACTTTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-4
GGAAACCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-5
GCTAAGGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-6
CATACTTTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-7
TATGGAAATCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-8
TGTATCCTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-9
ATATGGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ENA|LM608609|LM608609.1 TPA: Homo sapiens microRNA hsa-mir-1-1 precursor-10
CCCATAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Seems to kick in at k=5. Which is certainly pushing the limits of what one should normally do with a Markov method and a short sequence.