Closed danaderp closed 3 years ago
Unique tokens of codesearch net dataset are around 1763464. Antonis states that taking 20% of unique tokens is enough to train a simple BPE.
See this nb for training steps and usage.
Unique tokens of codesearch net dataset are around 1763464. Antonis states that taking 20% of unique tokens is enough to train a simple BPE.