Open-Deep-ML / DML-OpenProblem

Other
205 stars 55 forks source link

Byte Pair Encoding (NLP) #196

Open KushGrandhi opened 3 days ago

KushGrandhi commented 3 days ago

Byte Pair Encoding (BPE) for Subword Tokenization

Problem Statement: Design and implement an algorithm to tokenize a given corpus into subword units based on the frequency of adjacent character pairs using the Byte Pair Encoding (BPE) method. The algorithm should iteratively merge the most frequent character pairs in the corpus to create a compact and meaningful token vocabulary.

The task is to implement BPE by:

moe18 commented 3 days ago

Sounds like a good addition

KushGrandhi commented 2 days ago

sharing this just for reference https://github.com/Open-Deep-ML/DML-OpenProblem/pull/199