using System.Text.Json;
using Microsoft.ML.Tokenizers;
var vocabFilePath = @"vocab.json";
var mergeFilePath = @"merges.txt";
try
{
var tokenizer = new Tokenizer(new Bpe(vocabFilePath, mergeFilePath, "<unk>"));
var input = " Test String ";
var tokenizerEncodedResult = tokenizer.Encode(input);
Console.WriteLine(JsonSerializer.Serialize(tokenizerEncodedResult.Ids));
var tokenizerDecodedResult = tokenizer.Decode(tokenizerEncodedResult.Ids);
Console.WriteLine(tokenizerDecodedResult);
}
catch (Exception e)
{
Console.WriteLine(e);
}
System.InvalidOperationException: Invalid merger file format at line: 3869
at Microsoft.ML.Tokenizers.Bpe.ConvertMergesToHashmap(String mergesFile)
at Microsoft.ML.Tokenizers.Bpe.ReadFile(String vocab, String merges)
at Microsoft.ML.Tokenizers.Bpe..ctor(String vocabFile, String mergesFile, String unknownToken, String continuingSubwordPrefix, String endOfWordSuffix)
at Program.<Main>$(String[] args) in D:\dev\personal\ai\Program.cs:line 8
Expected behavior
BPE should process merges without exception
Simple solution
Read a json instead
internal static Vec<(string, string)> ConvertMergesToHashmap(string? mergesFile)
{
if (mergesFile is null)
{
return new Vec<(string, string)>();
}
Vec<(string, string)> merges = new(1000);
int lineNumber = 0;
List<string> mergesList;
using (Stream stream = File.OpenRead(mergesFile))
{
mergesList = JsonSerializer.Deserialize<List<string>>(stream);
}
foreach (string line in mergesList)
{
lineNumber++;
if (line.StartsWith("#version", StringComparison.Ordinal) || line.Length == 0)
{
continue;
}
int index = line.IndexOf(' ');
if (index < 0 || index == line.Length - 1 || line.IndexOf(' ', index + 1) >= 0)
{
throw new InvalidOperationException($"Invalid merger file format at line: {lineNumber}");
}
merges.Push((line.Substring(0, index), line.Substring(index + 1)));
}
return merges;
}
Additional context
I was able to fix the exception. However, BPE algorithm generates different results in comparison to google's sentencepiece that is widely used in ML.
System Information (please complete the following information):
Describe the bug BPE Tokenizer doesn't allow reading carriage return symbol. https://github.com/dotnet/machinelearning/blob/077a6b81966dc2c514572568917f36cb94e08ac4/src/Microsoft.ML.Tokenizers/Model/BPE.cs#L297 Reading text with ReadLines causes inability to read the carriage return symbol in merges.
To Reproduce Steps to reproduce the behavior:
Expected behavior BPE should process merges without exception
Simple solution Read a json instead
Additional context I was able to fix the exception. However, BPE algorithm generates different results in comparison to google's sentencepiece that is widely used in ML.