Tokenize strings of only N-types of characters?

ianderrington commented 1 year ago

Hi, I'm looking at this for tokenizing biological sequences: protein, DNA, RNA. These have between 4-22 letters, generally. When I use the procedure, it only finds the base-letters as tokens. The vocabulary that is produced consists of the base ascii characters less vocab.txt

^A
^B
^C
^D
^E
^F
^G
^H

^K
^L
^M
^N
^O
^P
^Q
^R
^S
^T
^U
^V
^W
^X
^Y
^Z
^[
^\
^]
^^
^_

!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
@
A
B
C
D
...

Even though none of those characters, except for ACGT were in the input text. The code I entered was this:

./getalltokens -charset utf8 -chunk-size 10000 -dataset my_data.txt -min-occur-chunk 2 -output vocab.vc -workers 2
./trainvocab -charset utf-8 -dataset my_data.txt -dir vocab_dir -dictionary vocab.vc -max-token-length 64 -vocab 4095
./exporttokens ./vocab_dir/794556_5737.zlib vocab -capcode -charset UTF-8 -txt

Can you advise?

alasdairforsythe commented 1 year ago

Yes, there is a parameter during training that adds all base characters, which you need to disable.

For your purpose you want to ensure you use charset "binary". With getalltokens there is a parameter for -include-single-bytes, set that to false. Then on trainvocab you need to set the -no-reserve-256 flag.

alasdairforsythe commented 1 year ago

Also you need to carefully think about all the parameters for each app. Currently you have -capcode only on exporttokens, which is wrong, it must be on all of them or none of them. For your case I'd recommend not to use capcode.

I see you tried to set -max-token-length on trainvocab but not on getalltokens, it should be set on both.

min-occur is not set so you'll be be using the default value which is intended for a 1GB dataset.

There's probably more, you need to go through each of the parameters and choose the appropriate values.

ianderrington commented 1 year ago

Thanks for the help! Would you have a buy-me-a-coffee? I always appreciate those dealing with my silly mistakes.

ianderrington commented 1 year ago

It worked very well. Thank you again!

alasdairforsythe / tokenmonster

Tokenize strings of only N-types of characters? #8