Closed ianderrington closed 1 year ago
Yes, there is a parameter during training that adds all base characters, which you need to disable.
For your purpose you want to ensure you use charset "binary". With getalltokens there is a parameter for -include-single-bytes, set that to false. Then on trainvocab you need to set the -no-reserve-256 flag.
Also you need to carefully think about all the parameters for each app. Currently you have -capcode
only on exporttokens, which is wrong, it must be on all of them or none of them. For your case I'd recommend not to use capcode.
I see you tried to set -max-token-length
on trainvocab but not on getalltokens, it should be set on both.
min-occur
is not set so you'll be be using the default value which is intended for a 1GB dataset.
There's probably more, you need to go through each of the parameters and choose the appropriate values.
Thanks for the help! Would you have a buy-me-a-coffee? I always appreciate those dealing with my silly mistakes.
It worked very well. Thank you again!
Hi, I'm looking at this for tokenizing biological sequences: protein, DNA, RNA. These have between 4-22 letters, generally. When I use the procedure, it only finds the base-letters as tokens. The vocabulary that is produced consists of the base ascii characters
less vocab.txt
Even though none of those characters, except for ACGT were in the input text. The code I entered was this:
Can you advise?