dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.94k stars 1.86k forks source link

BPE Tokenizer doesn't allow reading carriage return symbol #6800

Open MaxGrekhov opened 11 months ago

MaxGrekhov commented 11 months ago

System Information (please complete the following information):

Describe the bug BPE Tokenizer doesn't allow reading carriage return symbol. https://github.com/dotnet/machinelearning/blob/077a6b81966dc2c514572568917f36cb94e08ac4/src/Microsoft.ML.Tokenizers/Model/BPE.cs#L297 Reading text with ReadLines causes inability to read the carriage return symbol in merges.

To Reproduce Steps to reproduce the behavior:

using System.Text.Json;
using Microsoft.ML.Tokenizers;

var vocabFilePath = @"vocab.json";
var mergeFilePath = @"merges.txt";
try
{
    var tokenizer = new Tokenizer(new Bpe(vocabFilePath, mergeFilePath, "<unk>"));

    var input = " Test String ";

    var tokenizerEncodedResult = tokenizer.Encode(input);
    Console.WriteLine(JsonSerializer.Serialize(tokenizerEncodedResult.Ids));
    var tokenizerDecodedResult = tokenizer.Decode(tokenizerEncodedResult.Ids);
    Console.WriteLine(tokenizerDecodedResult);
}
catch (Exception e)
{
    Console.WriteLine(e);
}
System.InvalidOperationException: Invalid merger file format at line: 3869
   at Microsoft.ML.Tokenizers.Bpe.ConvertMergesToHashmap(String mergesFile)
   at Microsoft.ML.Tokenizers.Bpe.ReadFile(String vocab, String merges)
   at Microsoft.ML.Tokenizers.Bpe..ctor(String vocabFile, String mergesFile, String unknownToken, String continuingSubwordPrefix, String endOfWordSuffix)
   at Program.<Main>$(String[] args) in D:\dev\personal\ai\Program.cs:line 8

Expected behavior BPE should process merges without exception

Simple solution Read a json instead


internal static Vec<(string, string)> ConvertMergesToHashmap(string? mergesFile)
{
    if (mergesFile is null)
    {
        return new Vec<(string, string)>();
    }

    Vec<(string, string)> merges = new(1000);

    int lineNumber = 0;
    List<string> mergesList;
    using (Stream stream = File.OpenRead(mergesFile))
    {
        mergesList = JsonSerializer.Deserialize<List<string>>(stream);
    }
    foreach (string line in mergesList)
    {
        lineNumber++;
        if (line.StartsWith("#version", StringComparison.Ordinal) || line.Length == 0)
        {
            continue;
        }
        int index = line.IndexOf(' ');
        if (index < 0 || index == line.Length - 1 || line.IndexOf(' ', index + 1) >= 0)
        {
            throw new InvalidOperationException($"Invalid merger file format at line: {lineNumber}");
        }
        merges.Push((line.Substring(0, index), line.Substring(index + 1)));
    }

    return merges;
}

Additional context I was able to fix the exception. However, BPE algorithm generates different results in comparison to google's sentencepiece that is widely used in ML.