clasp-developers / clasp

clasp Common Lisp environment
https://clasp-developers.github.io/
2.54k stars 142 forks source link

`parse-namestring` doesn't accept unicode #1595

Closed fosskers closed 1 week ago

fosskers commented 2 weeks ago

Describe the bug

If we inspect the string "/foo/bar/ゆびわ", we see:

Dimensions: (12)
Element type: CHARACTER
Total size: 12
Adjustable: NIL
Fill pointer: NIL
Contents:
0: #\/
1: #\f
2: #\o
3: #\o
4: #\/
5: #\b
6: #\a
7: #\r
8: #\/
9: #\HIRAGANA_LETTER_YU
10: #\HIRAGANA_LETTER_BI
11: #\HIRAGANA_LETTER_WA

Wonderful. However, if we attempt to include a unicode character in a pathname, like (parse-namestring "/foo/bar/ゆびわ"), the debugger opens and we're told:

Cannot coerce string "/foo/bar/†s" to a base-string

Somewhat cryptic. However, base-string is a hint, and the Clasp docs also mention:

Clasp supports Unicode by default. code-char and char-code work with Unicode codepoints. ... Type base-char includes only single byte characters, i.e. Basic Latin and Latin-1 Supplement.

So perhaps somewhere in the depths of parsing the path, characters are assumed to be non-Unicode and a conversion to base-string (probably an array of base-char?) is attempted.

Expected behavior It should be possible to contain Unicode characters with pathnames, as people in non-English-speaking countries often have Unicode characters in filepaths on their computers.

Actual behavior (shown above)

Note also that this occurs for #p literals as well (probably powered by parse-namestring underneath).

Bike commented 2 weeks ago

Yes, a base-string is an array of base-chars, and base-chars are one byte, Latin-1 or something. The coercion is done here: https://github.com/clasp-developers/clasp/blob/9585462edb338550ecc59154031d48c25ecaa87f/src/core/pathname.cc#L1349-L1351 This code is pretty old, dating back to when Unicode support was new and Clasp was called BRCL.

The underlying parser in clasp_parseNamestring doesn't seem to actually be limited to byte characters on first glance, so maybe this will be easy to fix.

Bike commented 2 weeks ago

Nope, not that simple. namestring improperly treats components as base strings for reasons I don't follow, and the actual OS functions do not seem to work with your ring pathname. Let me try to sort all this out, I hate having secret English-centric assumptions littered around.

fosskers commented 2 weeks ago

Let me try to sort all this out, I hate having secret English-centric assumptions littered around.

Perhaps an innocent "ASCII will do for now" kind of thought, the lingering effects of which we're seeing here. Thanks for tackling this.

fosskers commented 1 week ago

I will test your fix locally as well.

Bike commented 1 week ago

Please do. I think I got it working, but I might have missed some of the more obscure filesystem access functions.

fosskers commented 1 week ago

As far as my tests are concerned, your fixes work. Thank you!