dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
MIT License
8.94k stars 1.86k forks source link

BPE Tokenizer doesn't allow reading carriage return symbol #6800

Open MaxGrekhov opened 11 months ago

MaxGrekhov commented 11 months ago

System Information (please complete the following information):

Describe the bug BPE Tokenizer doesn't allow reading carriage return symbol. Reading text with ReadLines causes inability to read the carriage return symbol in merges.

To Reproduce Steps to reproduce the behavior:

using System.Text.Json;
using Microsoft.ML.Tokenizers;

var vocabFilePath = @"vocab.json";
var mergeFilePath = @"merges.txt";
    var tokenizer = new Tokenizer(new Bpe(vocabFilePath, mergeFilePath, "<unk>"));

    var input = " Test String ";

    var tokenizerEncodedResult = tokenizer.Encode(input);
    var tokenizerDecodedResult = tokenizer.Decode(tokenizerEncodedResult.Ids);
catch (Exception e)
System.InvalidOperationException: Invalid merger file format at line: 3869
   at Microsoft.ML.Tokenizers.Bpe.ConvertMergesToHashmap(String mergesFile)
   at Microsoft.ML.Tokenizers.Bpe.ReadFile(String vocab, String merges)
   at Microsoft.ML.Tokenizers.Bpe..ctor(String vocabFile, String mergesFile, String unknownToken, String continuingSubwordPrefix, String endOfWordSuffix)
   at Program.<Main>$(String[] args) in D:\dev\personal\ai\Program.cs:line 8

Expected behavior BPE should process merges without exception

Simple solution Read a json instead

internal static Vec<(string, string)> ConvertMergesToHashmap(string? mergesFile)
    if (mergesFile is null)
        return new Vec<(string, string)>();

    Vec<(string, string)> merges = new(1000);

    int lineNumber = 0;
    List<string> mergesList;
    using (Stream stream = File.OpenRead(mergesFile))
        mergesList = JsonSerializer.Deserialize<List<string>>(stream);
    foreach (string line in mergesList)
        if (line.StartsWith("#version", StringComparison.Ordinal) || line.Length == 0)
        int index = line.IndexOf(' ');
        if (index < 0 || index == line.Length - 1 || line.IndexOf(' ', index + 1) >= 0)
            throw new InvalidOperationException($"Invalid merger file format at line: {lineNumber}");
        merges.Push((line.Substring(0, index), line.Substring(index + 1)));

    return merges;

Additional context I was able to fix the exception. However, BPE algorithm generates different results in comparison to google's sentencepiece that is widely used in ML.