Some Unicode Characters not working since Netatalk 4.x

andylemin commented 3 weeks ago

Describe the bug Since Netatalk 4.x some JP chars fail to be handled correctly. Documents with characters such as だ゙ in the filename result in users being unable to open these files.

Initially discovered with Netatalk 4.0.2 on FreeBSD 14.1 with Ventura clients. Existing files and folders on AFP share after upgrade to 4.0.2 results in permissions errors. After removing problematic chars (on server console via ssh), files become accessible again by clients via share. Clients cannot rename problematic filenames over AFP share.

When trying to reproduce with Netatalk 4.0.2 on Ubuntu and Ventura 13.7 client, trying to open a file with one of the problematic JP chars results in "The document “test だ゙.txt” could not be opened."

To Reproduce Build Netatalk 4.0.2 on Linux or FreeBSD using latest instructions at; https://netatalk.io/4.0/htmldocs/compile Use default afp.conf, add a test share. 1) When trying to save files to AFP share from MacOS clients with characters like だ゙ get error; The document “Untitled” could not be saved as “testだ゙.txt”. The file doesn’t exist. 2) Create file on server with this problematic char in filename in shared folder using SSH, file is created fine, but when clients try to open file it errors. user@ubuntu:/test$ touch test2だ゙.txt File created fine on server. When trying to open file on client, get error; The document “test2だ゙.txt” could not be opened. The file doesn’t exist.

FreeBSD install requires UnicodeData.txt download from https://www.unicode.org/Public/UNIDATA/UnicodeData.txt, Ubuntu install requires apt install of unicode-data package. Both have same problem. After downgrading to Netatalk 3.x, all inaccessible files and folders become accessible again.

NB; This does not affect all Unicode chars, only specific chars are impacted. It is unknown how many are impacted since 4.x.

The example for testing is だ゙ Can provide more chars with problem if required.

Expected behavior In Netatalk 2.x and 3.x all JP chars can be used in filenames and folders without issues, and all files and folders can be accessed and opened without issue. Netatalk 4.x should also support all Unicode chars.

Environment

Server OS: Ubuntu 24.04.1 LTS, FreeBSD14.1
Client OS: macOS Ventura
Netatalk Version: 4.0.2

Logs Attach syslogs from the malfunctioning process, maxdebug log level afp.log Log shows server start, one test client connecting to 'test' share, tries to open test2だ゙.txt file, client fails with The document “test2だ゙.txt” could not be opened. Server process stopped.

Additional context Netatalk does not crash

andylemin commented 3 weeks ago

Tried with both default and MySQL CNID's. No change in fault. Filtered log attached cat /var/log/afp.log | grep test2 > /var/log/afp.log.test2.log

MacOS Finder client tried to open test2だ゙.txt just once. Log seems to repeat same messages over and over with no obvious error for cause.

afp.log.test2.log

Thanks

andylemin commented 3 weeks ago

More examples of Unicode characters which do not work over Netatalk AFP including multiple languages; ï, ѓ, じ, ど, パ, ブ, プ

rdmark commented 3 weeks ago

@andylemin Were you able to get around to building netatalk4 with v14 of UnicodeData.txt as we discussed, to see if that was the breaking point?

It should just be a matter of downloading https://www.unicode.org/Public/14.0.0/ucd/UnicodeData.txt and then point -Dwith-unicode-data-path to the directory where you put it...

andylemin commented 3 weeks ago

That's the next step for this weekend. Wanted to validate scope of issue first

NJRoadfan commented 3 weeks ago

I suspect this is a problem with decomposed Unicode characters somewhere.

I have not been able to replicate the problem with my testing server running 4.0.2. Client is macOS Catalina. Since I don't have any sort of IME setup, I was cutting and pasting the "bad" characters from this page into file names. Tried creating on server first then reading from client. Copied files to/from the server, etc. All seem to work with no issues reading or writing the file.

One side note: Firefox seems to have issues rendering だ゙. A diacritic appears to float to the far right of the character, usually over the "x" in .txt!

rdmark commented 3 weeks ago

@andylemin Can you please confirm the contents of f.e. the generated libatalk/unicode/precompose.c source file when you get this issue? I found one potential fail state where the path to UnicodeData.txt doesn't resolve properly and the Perl script generates empty tables like this;

static const struct {
  unsigned int replacement;
  unsigned int base;
  unsigned int comb;
} precompositions[] = {
};

static const struct {
  unsigned int replacement;
  unsigned int base;
  unsigned int comb;
} decompositions[] = {
};

static const struct {
  unsigned int replacement_sp;
  unsigned int base_sp;
  unsigned int comb_sp;
} precompositions_sp[] = {
};

static const struct {
  unsigned int replacement_sp;
  unsigned int base_sp;
  unsigned int comb_sp;
} decompositions_sp[] = {
};

I doubt this is the exact problem you're having, because without the precompose tables, interaction with any Unicode character causes errors. But maybe we can get a hint to what's going on by looking at the generated sources.

Apart from this scenario, I haven't been able to reproduce exactly the bug you're seeing...

rdmark commented 3 weeks ago

@andylemin I've made some improvements in https://github.com/Netatalk/netatalk/pull/1692 that should at least prevent the issue where the code generation fails silently. Again, unlikely that is solves your issue but please try the latest main code just in case the added error handling catches another corner case.

andylemin commented 3 weeks ago

Hi. Ok so some interesting findings to share;

4.0.2 - Unicode 16 - NO -Dwith-unicode-data-path - Characters Fail 🚫 - NO libatalk/unicode/precompose.c 4.0.2 - Unicode 14 - NO -Dwith-unicode-data-path - Characters Fail 🚫 - NO libatalk/unicode/precompose.c NB; In both of the above tests, configure step says Using Unicode Character Database: UnicodeData.txt indicating it is finding the downloaded UnicodeData.txt (in source base path) even though -Dwith-unicode-data-path is not being set.

4.0.2 - Unicode 16 - WITH -Dwith-unicode-data-path - Characters Succeed 👍 - But still NO libatalk/unicode/precompose.c 4.0.2 - Unicode 14 - WITH -Dwith-unicode-data-path - Characters Succeed 👍 - But still NO libatalk/unicode/precompose.c

git clone (last commit log; Shore up Unicode char table script error handling and detection). main - Unicode 16 - NO -Dwith-unicode-data-path - Characters Succeed 👍 - NO libatalk/unicode/precompose.c after meson compile -C build main - Unicode 16 - WITH -Dwith-unicode-data-path - Characters Succeed 👍 - NO libatalk/unicode/precompose.c after meson compile -C build

Observations; Unicode version is not related. Setting -Dwith-unicode-data-path seems to fix it in last Release, even though configure output says it finds UnicodeData.txt. In the current git HEAD, the issue seems fixed with or without -Dwith-unicode-data-path. I have never seen libatalk/unicode/precompose.c generated across all tests..

So seems the issue was (4.0.2) related to -Dwith-unicode-data-path being required in spite of positive configure message. You have fixed it in HEAD such that -Dwith-unicode-data-path is no longer required. I wonder why libatalk/unicode/precompose.c has never once been successfully generated.

PS; Just to confirm, when rebuilding for each test I am using meson setup --reconfigure build each time before building. I am not using a clean source tree as --reconfigure seems to be enough

NJRoadfan commented 3 weeks ago

There should only be a file called precompose.h generated by make-precompose.h.pl. Additionally, utf16_case.c and utf16_casetable.h are generated by make-casetable.pl.

rdmark commented 3 weeks ago

So seems the issue was (4.0.2) related to -Dwith-unicode-data-path being required in spite of positive configure message. You have fixed it in HEAD such that -Dwith-unicode-data-path is no longer required. I wonder why libatalk/unicode/precompose.c has never once been successfully generated.

Thanks for the thorough testing. Yes we had a bug in 4.0.2 where Meson itself could find UnicodeData.txt but the Perl script couldn't because of relative path shenanigans... What I do now is to always prepend the absolute path to the source dir whenever you give it a relative dir. I also added the Netatalk source dir to the list of dirs to look for UnicodeData.txt.

PS; Just to confirm, when rebuilding for each test I am using meson setup --reconfigure build each time before building. I am not using a clean source tree as --reconfigure seems to be enough

I think this should be safe. I personally always do git clean -dfx && git reset --hard between tries to have a completely clean slate.

Netatalk / netatalk

Some Unicode Characters not working since Netatalk 4.x #1667