Closed enzomich closed 2 weeks ago
Are you sure the file encoding is utf8 ?
Yes I am. In the example above the prompt came from a pipe, but here is one where the command line redirects main's stdin to a UTF-8 encoded text file ..\Texts\TranslateGreek.txt
containing Translate "Σήμερον ἐστὶν εὔδια ἡμέρα"
(meaning "Today is a fair day"). It was prepared with Notepad, which allows to specify the encoding, and Python agrees that the file is indeed UTF-8 encoded:
C:\Users\enzom\AI\LlamaFeeder>Python
Python 3.12.2 (tags/v3.12.2:6abddd9, Feb 6 2024, 21:26:36) [MSC v.1937 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> with open(r'..\Texts\TranslateGreek.txt', 'r', encoding='utf-8') as file:
... print(file.read())
...
Translate "Σήμερον ἐστὶν εὔδια ἡμέρα"
>>>
However, when main's stdin is redirected to that file, the result is garbage:
C:\Users\enzom\AI\LlamaFeeder>\Users\enzom\AI\llama.cpp\llama-b2391-bin-win-cublas-cu12.2.0-x64\main -m \Users\enzom\AI\llama.cpp\Models\mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf --simple-io --instruct --temp 0.1 < ..\Texts\TranslateGreek.txt
Log start
main: build = 2391 (7ab7b733)
[...]
- If you want to submit another line, end your input with '\'.
> The English translation of "ΣήμεÏον á¼ÏÏὶν εá½Î´Î¹Î± ἡμÎÏα" is "The children are playing football."
>
>
> Trans
> late
> "
> Î
> £
[...] <-- Killed with Ctrl-C
llama_print_timings: load time = 2992.67 ms
llama_print_timings: sample time = 10.92 ms / 109 runs ( 0.10 ms per token, 9978.94 tokens per second)
llama_print_timings: prompt eval time = 11862.09 ms / 77 tokens ( 154.05 ms per token, 6.49 tokens per second)
llama_print_timings: eval time = 23289.80 ms / 109 runs ( 213.67 ms per token, 4.68 tokens per second)
llama_print_timings: total time = 35433.63 ms / 186 tokens
Instead of "Σήμερον ἐστὶν εὔδια ἡμέρα", main reads "ΣήμεÏον á¼ÏÏὶν εá½Î´Î¹Î± ἡμÎÏα". And that's what is read by Python opening the file as if it were ISO-8859-1 encoded:
C:\Users\enzom\AI\LlamaFeeder>Python
Python 3.12.2 (tags/v3.12.2:6abddd9, Feb 6 2024, 21:26:36) [MSC v.1937 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> with open(r'..\Texts\TranslateGreek.txt', 'r', encoding='iso-8859-1') as file:
... print(file.read())
...
Translate "ΣήμεÏον á¼ÏÏὶν εá½Î´Î¹Î± ἡμÎÏα"
>>>
@enzomich looks like we only set the output codepage to utf8, try adding
SetConsoleCP(CP_UTF8);
next to this line: https://github.com/ggerganov/llama.cpp/blob/076b08649ecc3b0e1c0709c2a086a63eddd1bf32/common/console.cpp#L89
I am not sure how it will affect other parts like special char inputs in non-piped scenarios.
@Green-Sky that fix didn't work, but this one did: I inserted after line 96 (just before the #else
):
if(simple_io) {
_setmode(_fileno(stdin), _O_U8TEXT);
}
From what I understand (but I may be wrong, non being very familiar with Windows) SetConsoleCP(...) affects the console, but _setmode(stdin, ...) affects the stdin file descriptor also when the input is redirected away from the console to a file or a pipe -- like in this case.
So, for what I'm concerned this issue may be considered closed, if the fix is brought to the code.
I can confirm that the bug is present when piping to main, and that the code presented by enzomich solves the issue.
@Green-Sky , as @misureaudio has confirmed both this issue and my fix, is there any chance of raising the status to "confirmed" and, evenctually, have the fix accepted and merged?
This issue was closed because it has been inactive for 14 days since being marked as stale.
why is this still not fixed? @ggerganov
Please open a PR with the proposed fix and we'll merge it. I don't have Windows environment to test this
@enzomich can you do a PR?
@enzomich can you do a PR?
I'm a bit busy in these days but I'll try.
Hi, Enzo Michelangeli proposed the following correction:
if (simple_io) { _setmode(_fileno(stdin), _O_U8TEXT); }
simply inserting the code snippet after line 96 in console.cpp
It works.
Attached, here, the corrected console.cpp
GMP
Il giorno dom 29 set 2024 alle ore 16:08 Enzo Michelangeli < @.***> ha scritto:
@enzomich https://github.com/enzomich can you do a PR?
I'm a bit busy in these days but I'll try.
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/6294#issuecomment-2381371752, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIIASFMOOFKS2M5EOW7SN6TZZACW7AVCNFSM6AAAAABFGVLSPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBRGM3TCNZVGI . You are receiving this because you were mentioned.Message ID: @.***>
namespace console {
//
// Console state
//
static bool advanced_display = false;
static bool simple_io = true;
static display_t current_display = reset;
static FILE* out = stdout;
static void* hConsole;
static FILE* tty = nullptr;
static termios initial_state;
//
// Init and cleanup
//
void init(bool use_simple_io, bool use_advanced_display) {
advanced_display = use_advanced_display;
simple_io = use_simple_io;
// Windows-specific console initialization
DWORD dwMode = 0;
hConsole = GetStdHandle(STD_OUTPUT_HANDLE);
if (hConsole == INVALID_HANDLE_VALUE || !GetConsoleMode(hConsole, &dwMode)) {
hConsole = GetStdHandle(STD_ERROR_HANDLE);
if (hConsole != INVALID_HANDLE_VALUE && (!GetConsoleMode(hConsole, &dwMode))) {
hConsole = nullptr;
simple_io = true;
}
}
if (hConsole) {
// Check conditions combined to reduce nesting
if (advanced_display && !(dwMode & ENABLE_VIRTUAL_TERMINAL_PROCESSING) &&
!SetConsoleMode(hConsole, dwMode | ENABLE_VIRTUAL_TERMINAL_PROCESSING)) {
advanced_display = false;
}
// Set console output codepage to UTF8
SetConsoleOutputCP(CP_UTF8);
}
HANDLE hConIn = GetStdHandle(STD_INPUT_HANDLE);
if (hConIn != INVALID_HANDLE_VALUE && GetConsoleMode(hConIn, &dwMode)) {
// Set console input codepage to UTF16
_setmode(_fileno(stdin), _O_WTEXT);
// Set ICANON (ENABLE_LINE_INPUT) and ECHO (ENABLE_ECHO_INPUT)
if (simple_io) {
dwMode |= ENABLE_LINE_INPUT | ENABLE_ECHO_INPUT;
} else {
dwMode &= ~(ENABLE_LINE_INPUT | ENABLE_ECHO_INPUT);
}
if (!SetConsoleMode(hConIn, dwMode)) {
simple_io = true;
}
}
if (simple_io) {
_setmode(_fileno(stdin), _O_U8TEXT);
}
// POSIX-specific console initialization
if (!simple_io) {
struct termios new_termios;
tcgetattr(STDIN_FILENO, &initial_state);
new_termios = initial_state;
new_termios.c_lflag &= ~(ICANON | ECHO);
new_termios.c_cc[VMIN] = 1;
new_termios.c_cc[VTIME] = 0;
tcsetattr(STDIN_FILENO, TCSANOW, &new_termios);
tty = fopen("/dev/tty", "w+");
if (tty != nullptr) {
out = tty;
}
}
setlocale(LC_ALL, "");
}
void cleanup() {
// Reset console display
set_display(reset);
// Restore settings on POSIX systems
if (!simple_io) {
if (tty != nullptr) {
out = stdout;
fclose(tty);
tty = nullptr;
}
tcsetattr(STDIN_FILENO, TCSANOW, &initial_state);
}
}
//
// Display and IO
//
// Keep track of current display and only emit ANSI code if it changes
void set_display(display_t display) {
if (advanced_display && current_display != display) {
fflush(stdout);
switch(display) {
case reset:
fprintf(out, ANSI_COLOR_RESET);
break;
case prompt:
fprintf(out, ANSI_COLOR_YELLOW);
break;
case user_input:
fprintf(out, ANSI_BOLD ANSI_COLOR_GREEN);
break;
case error:
fprintf(out, ANSI_BOLD ANSI_COLOR_RED);
}
current_display = display;
fflush(out);
}
}
static char32_t getchar32() {
HANDLE hConsole = GetStdHandle(STD_INPUT_HANDLE);
wchar_t high_surrogate = 0;
while (true) {
INPUT_RECORD record;
DWORD count;
if (!ReadConsoleInputW(hConsole, &record, 1, &count) || count == 0) {
return WEOF;
}
if (record.EventType == KEY_EVENT && record.Event.KeyEvent.bKeyDown) {
wchar_t wc = record.Event.KeyEvent.uChar.UnicodeChar;
if (wc == 0) {
continue;
}
if ((wc >= 0xD800) && (wc <= 0xDBFF)) { // Check if wc is a high surrogate
high_surrogate = wc;
continue;
}
if ((wc >= 0xDC00) && (wc <= 0xDFFF)) { // Check if wc is a low surrogate
if (high_surrogate != 0) { // Check if we have a high surrogate
return ((high_surrogate - 0xD800) << 10) + (wc - 0xDC00) + 0x10000;
}
}
high_surrogate = 0; // Reset the high surrogate
return static_cast<char32_t>(wc);
}
}
wchar_t wc = getwchar();
if (static_cast<wint_t>(wc) == WEOF) {
return WEOF;
}
if ((wc >= 0xD800) && (wc <= 0xDBFF)) { // Check if wc is a high surrogate
wchar_t low_surrogate = getwchar();
if ((low_surrogate >= 0xDC00) && (low_surrogate <= 0xDFFF)) { // Check if the next wchar is a low surrogate
return (static_cast<char32_t>(wc & 0x03FF) << 10) + (low_surrogate & 0x03FF) + 0x10000;
}
}
if ((wc >= 0xD800) && (wc <= 0xDFFF)) { // Invalid surrogate pair
return 0xFFFD; // Return the replacement character U+FFFD
}
return static_cast<char32_t>(wc);
}
static void pop_cursor() {
if (hConsole != NULL) {
CONSOLE_SCREEN_BUFFER_INFO bufferInfo;
GetConsoleScreenBufferInfo(hConsole, &bufferInfo);
COORD newCursorPosition = bufferInfo.dwCursorPosition;
if (newCursorPosition.X == 0) {
newCursorPosition.X = bufferInfo.dwSize.X - 1;
newCursorPosition.Y -= 1;
} else {
newCursorPosition.X -= 1;
}
SetConsoleCursorPosition(hConsole, newCursorPosition);
return;
}
putc('\b', out);
}
static int estimateWidth(char32_t codepoint) {
(void)codepoint;
return 1;
return wcwidth(codepoint);
}
static int put_codepoint(const char* utf8_codepoint, size_t length, int expectedWidth) {
CONSOLE_SCREEN_BUFFER_INFO bufferInfo;
if (!GetConsoleScreenBufferInfo(hConsole, &bufferInfo)) {
// go with the default
return expectedWidth;
}
COORD initialPosition = bufferInfo.dwCursorPosition;
DWORD nNumberOfChars = length;
WriteConsole(hConsole, utf8_codepoint, nNumberOfChars, &nNumberOfChars, NULL);
CONSOLE_SCREEN_BUFFER_INFO newBufferInfo;
GetConsoleScreenBufferInfo(hConsole, &newBufferInfo);
// Figure out our real position if we're in the last column
if (utf8_codepoint[0] != 0x09 && initialPosition.X == newBufferInfo.dwSize.X - 1) {
DWORD nNumberOfChars;
WriteConsole(hConsole, &" \b", 2, &nNumberOfChars, NULL);
GetConsoleScreenBufferInfo(hConsole, &newBufferInfo);
}
int width = newBufferInfo.dwCursorPosition.X - initialPosition.X;
if (width < 0) {
width += newBufferInfo.dwSize.X;
}
return width;
// We can trust expectedWidth if we've got one
if (expectedWidth >= 0 || tty == nullptr) {
fwrite(utf8_codepoint, length, 1, out);
return expectedWidth;
}
fputs("\033[6n", tty); // Query cursor position
int x1;
int y1;
int x2;
int y2;
int results = 0;
results = fscanf(tty, "\033[%d;%dR", &y1, &x1);
fwrite(utf8_codepoint, length, 1, tty);
fputs("\033[6n", tty); // Query cursor position
results += fscanf(tty, "\033[%d;%dR", &y2, &x2);
if (results != 4) {
return expectedWidth;
}
int width = x2 - x1;
if (width < 0) {
// Calculate the width considering text wrapping
struct winsize w;
ioctl(STDOUT_FILENO, TIOCGWINSZ, &w);
width += w.ws_col;
}
return width;
}
static void replace_last(char ch) {
pop_cursor();
put_codepoint(&ch, 1, 1);
fprintf(out, "\b%c", ch);
}
static void append_utf8(char32_t ch, std::string & out) {
if (ch <= 0x7F) {
out.push_back(static_cast<unsigned char>(ch));
} else if (ch <= 0x7FF) {
out.push_back(static_cast<unsigned char>(0xC0 | ((ch >> 6) & 0x1F)));
out.push_back(static_cast<unsigned char>(0x80 | (ch & 0x3F)));
} else if (ch <= 0xFFFF) {
out.push_back(static_cast<unsigned char>(0xE0 | ((ch >> 12) & 0x0F)));
out.push_back(static_cast<unsigned char>(0x80 | ((ch >> 6) & 0x3F)));
out.push_back(static_cast<unsigned char>(0x80 | (ch & 0x3F)));
} else if (ch <= 0x10FFFF) {
out.push_back(static_cast<unsigned char>(0xF0 | ((ch >> 18) & 0x07)));
out.push_back(static_cast<unsigned char>(0x80 | ((ch >> 12) & 0x3F)));
out.push_back(static_cast<unsigned char>(0x80 | ((ch >> 6) & 0x3F)));
out.push_back(static_cast<unsigned char>(0x80 | (ch & 0x3F)));
} else {
// Invalid Unicode code point
}
}
// Helper function to remove the last UTF-8 character from a string
static void pop_back_utf8_char(std::string & line) {
if (line.empty()) {
return;
}
size_t pos = line.length() - 1;
// Find the start of the last UTF-8 character (checking up to 4 bytes back)
for (size_t i = 0; i < 3 && pos > 0; ++i, --pos) {
if ((line[pos] & 0xC0) != 0x80) {
break; // Found the start of the character
}
}
line.erase(pos);
}
static bool readline_advanced(std::string & line, bool multiline_input) {
if (out != stdout) {
fflush(stdout);
}
line.clear();
std::vector<int> widths;
bool is_special_char = false;
bool end_of_stream = false;
char32_t input_char;
while (true) {
fflush(out); // Ensure all output is displayed before waiting for input
input_char = getchar32();
if (input_char == '\r' || input_char == '\n') {
break;
}
if (input_char == (char32_t) WEOF || input_char == 0x04 /* Ctrl+D*/) {
end_of_stream = true;
break;
}
if (is_special_char) {
set_display(user_input);
replace_last(line.back());
is_special_char = false;
}
if (input_char == '\033') { // Escape sequence
char32_t code = getchar32();
if (code == '[' || code == 0x1B) {
// Discard the rest of the escape sequence
while ((code = getchar32()) != (char32_t) WEOF) {
if ((code >= 'A' && code <= 'Z') || (code >= 'a' && code <= 'z') || code == '~') {
break;
}
}
}
} else if (input_char == 0x08 || input_char == 0x7F) { // Backspace
if (!widths.empty()) {
int count;
do {
count = widths.back();
widths.pop_back();
// Move cursor back, print space, and move cursor back again
for (int i = 0; i < count; i++) {
replace_last(' ');
pop_cursor();
}
pop_back_utf8_char(line);
} while (count == 0 && !widths.empty());
}
} else {
int offset = line.length();
append_utf8(input_char, line);
int width = put_codepoint(line.c_str() + offset, line.length() - offset, estimateWidth(input_char));
if (width < 0) {
width = 0;
}
widths.push_back(width);
}
if (!line.empty() && (line.back() == '\\' || line.back() == '/')) {
set_display(prompt);
replace_last(line.back());
is_special_char = true;
}
}
bool has_more = multiline_input;
if (is_special_char) {
replace_last(' ');
pop_cursor();
char last = line.back();
line.pop_back();
if (last == '\\') {
line += '\n';
fputc('\n', out);
has_more = !has_more;
} else {
// llama will just eat the single space, it won't act as a space
if (line.length() == 1 && line.back() == ' ') {
line.clear();
pop_cursor();
}
has_more = false;
}
} else {
if (end_of_stream) {
has_more = false;
} else {
line += '\n';
fputc('\n', out);
}
}
fflush(out);
return has_more;
}
static bool readline_simple(std::string & line, bool multiline_input) {
std::wstring wline;
if (!std::getline(std::wcin, wline)) {
// Input stream is bad or EOF received
line.clear();
GenerateConsoleCtrlEvent(CTRL_C_EVENT, 0);
return false;
}
int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wline[0], (int)wline.size(), NULL, 0, NULL, NULL);
line.resize(size_needed);
WideCharToMultiByte(CP_UTF8, 0, &wline[0], (int)wline.size(), &line[0], size_needed, NULL, NULL);
if (!std::getline(std::cin, line)) {
// Input stream is bad or EOF received
line.clear();
return false;
}
if (!line.empty()) {
char last = line.back();
if (last == '/') { // Always return control on '/' symbol
line.pop_back();
return false;
}
if (last == '\\') { // '\\' changes the default action
line.pop_back();
multiline_input = !multiline_input;
}
}
line += '\n';
// By default, continue input if multiline_input is set
return multiline_input;
}
bool readline(std::string & line, bool multiline_input) {
set_display(user_input);
if (simple_io) {
return readline_simple(line, multiline_input);
}
return readline_advanced(line, multiline_input);
}
}
@ggerganov I made a PR #9690 Please review it.
For a small RAG application I have written a Python wrapper that opens Llama.cpp's main into a subprocess using
subprocess.Popen()
and communicates with it through two pipes (yes, I'm using the--simple-io option
). Everything works fine, with an exception: if the line sent to main's stdin contains non-ASCII characters (e.g., Greek or Cyrillic or even just Latin with accents or other diacritical marks) those characters, and only those, are received as garbled text (and understood by the model with a lot of fantasy). Initially I thought that I was doing something wrong, but then I discovered exactly the same thing happens without my Python wrapper, by launching main at the command line and redirecting its stdin using a "main < file.txt
" or "echo input_line | main
" command:Please also note the garbage in the following lines until main is killed with a Ctrl-C, as if it hadn't noticed that the pipe was closed at the other side.
On the other hand, if the instruction is entered at the console prompt everything works as expected:
Any idea about how to fix this?