Initialization in WSL failed.

RoggeOhta commented 1 month ago

OS : Ubuntu 20.04 in WSL log:

❯ curl -sSfL 'https://raw.githubusercontent.com/GaiaNet-AI/gaianet-node/main/install.sh' | bash -s -- --unprivileged

 ██████╗  █████╗ ██╗ █████╗ ███╗   ██╗███████╗████████╗
██╔════╝ ██╔══██╗██║██╔══██╗████╗  ██║██╔════╝╚══██╔══╝
██║  ███╗███████║██║███████║██╔██╗ ██║█████╗     ██║
██║   ██║██╔══██║██║██╔══██║██║╚██╗██║██╔══╝     ██║
╚██████╔╝██║  ██║██║██║  ██║██║ ╚████║███████╗   ██║
 ╚═════╝ ╚═╝  ╚═╝╚═╝╚═╝  ╚═╝╚═╝  ╚═══╝╚══════╝   ╚═╝

[+] Installing gaianet CLI tool ...
######################################################################## 100.0%
    * gaianet CLI tool is installed in /home/rogge/gaianet/gaianet

[+] Downloading default config file ...
    * Use the cached config file in /home/rogge/gaianet

[+] Installing WasmEdge with wasi-nn_ggml plugin ...
Info: Detected Linux-x86_64

No root permissions.
Installation path found at /home/rogge/.wasmedge
Removing /home/rogge/.wasmedge//bin/wasmedge
Removing /home/rogge/.wasmedge//bin/wasmedgec
Removing /home/rogge/.wasmedge//include/wasmedge/
Removing /home/rogge/.wasmedge//include/wasmedge/enum_configure.h
Removing /home/rogge/.wasmedge//include/wasmedge/version.h
Removing /home/rogge/.wasmedge//include/wasmedge/enum_errcode.h
Removing /home/rogge/.wasmedge//include/wasmedge/wasmedge.h
Removing /home/rogge/.wasmedge//include/wasmedge/int128.h
Removing /home/rogge/.wasmedge//include/wasmedge/enum_types.h
Removing /home/rogge/.wasmedge//include/wasmedge/enum.inc
Removing /home/rogge/.wasmedge//lib/libwasmedge.so.0.0.3
Removing /home/rogge/.wasmedge//lib/libwasmedge.so
Removing /home/rogge/.wasmedge//lib/libwasmedge.so.0
Removing /home/rogge/.wasmedge/plugin/libwasmedgePluginWasiNN.so
Removing /home/rogge/.wasmedge/env
Removing /home/rogge/.wasmedge/include/wasmedge
Removing /home/rogge/.wasmedge/bin
Removing /home/rogge/.wasmedge/lib
Removing /home/rogge/.wasmedge/plugin
Removing /home/rogge/.wasmedge/include
Removing /home/rogge/.wasmedge
Info: WasmEdge Installation at /home/rogge/.wasmedge

Info: Fetching WasmEdge-0.13.5

/tmp/wasmedge.91981 ~/gaianet
######################################################################## 100.0%
~/gaianet
Info: Fetching WasmEdge-GGML-Plugin

Info: Detected CUDA version: 12

/tmp/wasmedge.91981 ~/gaianet
######################################################################## 100.0%
~/gaianet
Installation of wasmedge-0.13.5 successful
WasmEdge binaries accessible
    * The wasmedge version 0.13.5 is installed in /home/rogge/.wasmedge/bin/wasmedge.

[+] Installing Qdrant binary...
    * Use the cached Qdrant binary in /home/rogge/gaianet/bin

[+] Downloading the rag-api-server.wasm ...
######################################################################## 100.0%
    * The rag-api-server.wasm is downloaded in /home/rogge/gaianet

[+] Downloading dashboard ...
    * Use the cached dashboard in /home/rogge/gaianet

[+] Generating node ID ...
    * Use the cached registry.wasm in /home/rogge/gaianet

    * Generate node ID
You already have a private key.

[+] Installing gaianet-domain...
    * Download gaianet-domain binary
######################################################################## 100.0%
      gaianet-domain is downloaded in /home/rogge/gaianet

    * Install frpc binary
      frpc binary is installed in /home/rogge/gaianet/bin

    * Download frpc.toml
      frpc.toml is downloaded in /home/rogge/gaianet/gaianet-domain

[+] COMPLETED! The gaianet node has been installed successfully.

Your node ID is 0xe934cdc2a0c31f9c410c99326bac37fbb240e580. Please register it in your portal account to receive awards!

>>> Next, you should initialize the GaiaNet node with the LLM and knowledge base. Run the command: gaianet init <<<

~/playgrounds/gaianet took 28s
❯ echo $?
0

~/playgrounds/gaianet
❯ gaianet init
[+] Checking the config.json file ...

[+] Downloading Phi-3-mini-4k-instruct-Q5_K_M.gguf ...
    * Using the cached Phi-3-mini-4k-instruct-Q5_K_M.gguf in /home/rogge/gaianet

[+] Downloading all-MiniLM-L6-v2-ggml-model-f16.gguf ...
    * Using the cached all-MiniLM-L6-v2-ggml-model-f16.gguf in /home/rogge/gaianet

[+] Creating 'default' collection in the Qdrant instance ...
    * Start a Qdrant instance ...

    * Remove the existed 'default' Qdrant collection ...

❯ echo $?
28

It may seems it is a network issue?

juntao commented 1 month ago

It seems that qdrant did not start properly. Can you paste the contents from the following two files?

gaianet/log/init-qdrant.log

and

gaianet/log/start-qdrant.log

RoggeOhta commented 1 month ago

I ran it again. this time error is different.

❯ gaianet init
[+] Checking the config.json file ...

[+] Downloading Phi-3-mini-4k-instruct-Q5_K_M.gguf ...
################################################################################################################# 100.0%################################################################################################################# 100.0%
    * Phi-3-mini-4k-instruct-Q5_K_M.gguf is downloaded in /home/rogge/gaianet

[+] Downloading all-MiniLM-L6-v2-ggml-model-f16.gguf ...
################################################################################################################# 100.0%################################################################################################################# 100.0%
    * all-MiniLM-L6-v2-ggml-model-f16.gguf is downloaded in /home/rogge/gaianet

[+] Creating 'default' collection in the Qdrant instance ...
    * Start a Qdrant instance ...

    * Remove the existed 'default' Qdrant collection ...

    * Download Qdrant collection snapshot ...
################################################################################################################# 100.0%################################################################################################################# 100.0%
      The snapshot is downloaded in /home/rogge/gaianet

    * Import the Qdrant collection snapshot ...
      The process may take a few minutes. Please wait ...
    * [Error] Failed to recover from the collection snapshot. {"status":{"error":"Service internal error: Tokio task join error: task 1242 panicked"},"time":0.697784244}
    * [Error] Failed to recover from the collection snapshot. {"status":{"error":"Service internal error: Tokio task join error: task 1242 panicked"},"time":0.697784244}

and here is the init-qdrant.log. no start-qdrant.log init-qdrant.log

juntao commented 1 month ago

As this line indicates, Qdrant ran out of memory during the import. How much RAM do you have on the WSL system? Thanks.

2024-05-20T07:24:52.900895Z ERROR qdrant::startup: Panic occurred in file /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cgroups-rs-0.3.4/src/memory.rs at line 587: called `Result::unwrap()` on an `Err` value: Error { kind: ReadFailed("/sys/fs/cgroup/memory.high"), cause: Some(Os { code: 2, kind: NotFound, message: "No such file or directory" }) }

RoggeOhta commented 1 month ago

As this line indicates, Qdrant ran out of memory during the import. How much RAM do you have on the WSL system? Thanks.

2024-05-20T07:24:52.900895Z ERROR qdrant::startup: Panic occurred in file /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cgroups-rs-0.3.4/src/memory.rs at line 587: called `Result::unwrap()` on an `Err` value: Error { kind: ReadFailed("/sys/fs/cgroup/memory.high"), cause: Some(Os { code: 2, kind: NotFound, message: "No such file or directory" }) }

I'm running WSL on a 16G physical memory machine, with WSL memory as below

❯ free -mh
              total        used        free      shared  buff/cache   available
Mem:          7.6Gi       679Mi       3.5Gi        73Mi       3.4Gi       6.6Gi
Swap:          10Gi       4.0Mi         9Gi

RoggeOhta commented 1 month ago

I try adding 10G swap, still the same error

RoggeOhta commented 1 month ago

And I checked path /sys/fs/cgroup, there is no memory.high

/sys/fs/cgroup🔒
❯ ll
total 0
-r--r--r--  1 root root 0 May 20 15:50 cgroup.controllers
-rw-r--r--  1 root root 0 May 20 15:50 cgroup.max.depth
-rw-r--r--  1 root root 0 May 20 15:50 cgroup.max.descendants
-rw-r--r--  1 root root 0 May 20 15:50 cgroup.procs
-r--r--r--  1 root root 0 May 20 15:50 cgroup.stat
-rw-r--r--  1 root root 0 May 20 15:50 cgroup.subtree_control
-rw-r--r--  1 root root 0 May 20 15:50 cgroup.threads
-r--r--r--  1 root root 0 May 20 15:50 cpuset.cpus.effective
-r--r--r--  1 root root 0 May 20 15:50 cpuset.mems.effective
-r--r--r--  1 root root 0 May 20 15:50 cpu.stat
drwxr-xr-x  2 root root 0 May 20 15:50 init.scope
-r--r--r--  1 root root 0 May 20 15:50 io.stat
--w-------  1 root root 0 May 20 15:50 memory.reclaim
-r--r--r--  1 root root 0 May 20 15:50 memory.stat
-r--r--r--  1 root root 0 May 20 15:50 misc.capacity
drwxr-xr-x 43 root root 0 May 20 15:56 system.slice
drwxr-xr-x  3 root root 0 May 20 15:51 user.slice

RoggeOhta commented 1 month ago

I believe this issue from cgroup-rs might hint the reason for panic. https://github.com/kata-containers/cgroups-rs/issues/115

The author of library says:

It is because the api only supports cgroup v1, while the systems are in v2.

RoggeOhta commented 1 month ago

I found the full solution and reason for this problem.

The reason: Since cgroup-rs get_max_value API only supports cgroupv1, so in an only cgroupv2 environment this API is going to panic. By default, WSL2 will use both cgroupv1 & v2, but I used an experimental feature autoMemoryReclaim, this will automatically disable cgroupv1, leaving only v2, So it will cause the problem.

Solution: Consider support V2 API, and remind WSL user to disable autoMemoryReclaim feature.

juntao commented 1 month ago

Cool! Since Qdrant is upstream from us, I guess we have to do the second option. Can you send a screenshot that shows where this option is turned off? Thanks!

RoggeOhta commented 1 month ago

Cool! Since Qdrant is upstream from us, I guess we have to do the second option. Can you send a screenshot that shows where this option is turned off? Thanks!

Of course. This feature is off by default. In case others accidentally turn it on and don't know how to switch it off.

Step to turn on/off this feature:

Edit C:\Users\.wslconfig
Remove or comment autoMemoryReclaim in [experimental] section.

juntao commented 1 month ago

This is great. Thank you! I updated the docs and linked to your profile for acknowledgment.

https://docs.gaianet.ai/node-guide/troubleshooting/#fail-to-recover-from-collection-snapshot-on-windows-wsl

Please keep us updated about your progress!

GaiaNet-AI / gaianet-node

Initialization in WSL failed. #46