NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
16.52k stars 13k forks source link

dyalog: isolates (parallel forEach and co) appear to be broken #316439

Open bolives-hax opened 1 month ago

bolives-hax commented 1 month ago

Describe the bug

Isolates parallelEach doesn't work

Steps To Reproduce

Steps to reproduce the behavior: run nix develop with

        devShells.default = with pkgs; mkShell rec {
          buildInputs = [
            (unfreePkgs.dyalog.override {
              acceptLicense = true;
            })
          ];

then run: dyalog then in the dyalog repl type

)load isolate
⎕DL isolate.llEach (1 2 3)

Expected behavior

⎕DL isolate.llEach (1 2 3) returning

1.00383 2.003841 3.004522

like its the case when using the .deb on lets say a debian host

Additional context

strace log reveals a lot of lines like

clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, NULL) = 0
.... (hundreds of times)
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, NULL) = 0

which does not happen on a debian host when I install dyalog 19 via dpkg -i

I have also tested this on a debian host that has nix installed and installed dyalog via nix instead of dpkg but I run into the same issue making me believe there is something wrong with the drv

Notify maintainers

@TomaSajt @markus1189

Metadata

system: "x86_64-linux", multi-user?: yes, version: nix-env (Nix) 2.18.2, nixpkgs: /etc/channels/nixpkgs

(I ran it against the current (1 hour ago) nixpkgs master/unstable branches) as in I pulled these 2 about 1 hour ago

TomaSajt commented 1 month ago

I'll see what I can do about this. In the meantime, could you change the issue title to start with dyalog:?

bolives-hax commented 1 month ago

I'll see what I can do about this. In the meantime, could you change the issue title to start with dyalog:?

sure

TomaSajt commented 1 month ago

Phew, finally found it! The issue was with the APLProcess class, which auto-crashed when created. I redirected its output to a file, and it turns out that because I didn't patch dyalog.rt to require ncurses it wasn't able to load it, thus erroring out.

bolives-hax commented 1 month ago

Phew, finally found it! The issue was with the APLProcess class, which auto-crashed when created. I redirected its output to a file, and it turns out that because I didn't patch dyalog.rt to require ncurses it wasn't able to load it, thus erroring out.

This is awesome, thanks for the incredibly fast response/fix time. I suspected it was something like this I initially thought that maybe some of the libraries weren't copied over since the nix expression is a little pickier than just cp *.so $out/lib but upon fixing that nothing change really. Usually knowing when to apply patchelf is quite clear as software will let you know if a library is missing so throwing ldd against it generally gives some pretty good insight.

Though if you'd like sharing this information, would you mind telling me how you found out? Was it just intuition or did you use some sort of tool to debug it to come to the conclusion that this may be the issue. As i couldn't really get hold of logs making any sense when I tried it. This would be quite interesting to me as I most likely will get to fix more dyalog stuff in the future.

TomaSajt commented 1 month ago

Phew, finally found it! The issue was with the APLProcess class, which auto-crashed when created. I redirected its output to a file, and it turns out that because I didn't patch dyalog.rt to require ncurses it wasn't able to load it, thus erroring out.

This is awesome, thanks for the incredibly fast response/fix time. I suspected it was something like this I initially thought that maybe some of the libraries weren't copied over since the nix expression is a little pickier than just cp *.so $out/lib but upon fixing that nothing change really. Usually knowing when to apply patchelf is quite clear as software will let you know if a library is missing so throwing ldd against it generally gives some pretty good insight.

Though if you'd like sharing this information, would you mind telling me how you found out? Was it just intuition or did you use some sort of tool to debug it to come to the conclusion that this may be the issue. As i couldn't really get hold of logs making any sense when I tried it. This would be quite interesting to me as I most likely will get to fix more dyalog stuff in the future.

First of all, I somehow got my hand on the aplcore file that lists where it crashed exactly, then looked through it, found the definitions in the namespace explorer. It looked like it was supposed to start some processes and then connect to them, however they exited immediately. I was able to verify this by printing procs.HasExited in isolate.ynys.InitProcesses. procs is an array of new APLProcess objects, so I looked up the docs, and it turns out you can specify an OutFile, which specifies where a process should dump its logs (its the 5th parameter of ⎕NEW APLProcess), so I overrode that to point to some known location. That file contained the following:

Unable to load terminfo library:
libncurses.so.5: cannot open shared object file: No such file or directory
Without it, the tty version of Dyalog APL will not function properly.
Please install libtinfo or libncurses.

Dyalog APL could not initialise.

In 18.2 libncurses was required by the binary, so autoPatchelfHook took care of it. However in 19.0 they probably changed it to load libnurses dynamically using dlopen which can't be auto-detected with autoPatchefHook. I had to use --add-needed with the normal dyalog binary to make it work, but I didn't know what dyalog.rt was so I didn't do anything with it.


Note: in 19.0 a new binary named dyalogc appeared, however I don't know where it's used, so I don't know what I should do with it. Might have to ask around for it.