Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.86k stars 2.94k forks source link

aarch64 platform alluxio fuse mount successful but can not be accessed: Input/output error #17135

Closed michael1589 closed 1 year ago

michael1589 commented 1 year ago

Alluxio Version: Alluxio release-2.9.0(branch) Alluxio version: 2.9.1-SNAPSHOT-7ce08597a431686ebc0c2d45c2d152963179624e

Describe the bug I use 3 nodes k8s cluster to deploy alluxio cluster(1 alluxio master/3 alluxio worker/3 alluxio fuse). Those nodes contains 2 x86-64 machine and the other is aarch64. Alluxio fuse in x86-64 works fine but fails in aarch64. It seems that mount successfully but can not access.

[root@alluxio-fuse-txz4v alluxio-2.9.1-SNAPSHOT]# /opt/alluxio/integration/fuse/bin/alluxio-fuse stat
pid mount_point alluxio_path
149712  /data/alluxio   /
[root@alluxio-fuse-txz4v alluxio-2.9.1-SNAPSHOT]# ls -lhrt /data/alluxio
ls: cannot access /data/alluxio: Input/output error
[root@alluxio-fuse-txz4v alluxio-2.9.1-SNAPSHOT]# ls -lhrt /data        
ls: cannot access /data/alluxio: Input/output error
total 0
d????????? ? ?    ?     ?            ? alluxio
drwxr-xr-x 2 root root  6 Mar 13 07:58 efk-es3
drwxr-xr-x 2 root root  6 Mar 13 07:58 efk-es2
drwxr-xr-x 2 root root  6 Mar 13 07:58 efk-es1
drwxr-xr-x 2 root root  6 Mar 13 07:58 {alluxio-domain}
drwxr-xr-x 2 root root  6 Mar 13 07:58 minio5
drwxr-xr-x 2 root root  6 Mar 13 07:58 minio4
drwxr-xr-x 2 root root  6 Mar 13 07:58 minio2
drwxr-xr-x 2 root root  6 Mar 13 07:58 kafka2
drwxr-xr-x 2 root root  6 Mar 13 07:58 kafka1
drwxr-xr-x 2 root root  6 Mar 13 07:58 zk2
drwxr-xr-x 2 root root  6 Mar 13 07:58 zk1
drwxr-xr-x 2 root root  6 Mar 13 07:58 clickhouse3
drwxr-xr-x 2 root root  6 Mar 13 07:58 clickhouse2
drwxr-xr-x 2 root root  6 Mar 13 07:58 clickhouse1
drwxr-xr-x 6 root root 66 Mar 13 08:13 minio1
drwxr-xr-x 3 root root 22 Mar 13 08:28 k8s
drwxrwsr-x 4 root root 30 Mar 13 08:39 zk3
drwxrwsr-x 3 root root 24 Mar 13 08:40 kafka3
drwxr-xr-x 6 root root 66 Mar 20 07:32 minio6
drwxr-xr-x 6 root root 66 Mar 20 07:32 minio3
drwxr-xr-x 2 root root 50 Mar 23 03:08 alluxio-domain
[root@alluxio-fuse-txz4v alluxio-2.9.1-SNAPSHOT]# /opt/alluxio/bin/alluxio fs ls /
drwx------  root           root                         0       PERSISTED 03-23-2023 03:19:14:710  DIR /.alluxio_s3_api_metadata
drwxr-xr-x  root           root                         0   NOT_PERSISTED 03-23-2023 03:34:37:536  DIR /people
drwx------  root           root                         0       PERSISTED 03-23-2023 03:19:14:757  DIR /publish-data
[root@alluxio-fuse-txz4v alluxio-2.9.1-SNAPSHOT]# uname -a
Linux alluxio-fuse-txz4v 4.14.0-115.el7a.0.1.aarch64 #1 SMP Sun Nov 25 20:54:21 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux
[root@alluxio-fuse-txz4v logs]# vim fuse.log 
2023-03-23 03:38:37,520 INFO  AlluxioFuse - Alluxio version: 2.9.1-SNAPSHOT-7ce08597a431686ebc0c2d45c2d152963179624e
2023-03-23 03:38:37,827 INFO  AlluxioFuse - Set fuse mount point options as allow_other from command line input
2023-03-23 03:38:38,027 INFO  NettyUtils - EPOLL_MODE is available
2023-03-23 03:38:38,709 INFO  MetricsSystem - Starting sinks with config: {}.
2023-03-23 03:38:38,722 INFO  TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=172.16.0.58, rack=null)
2023-03-23 03:38:38,729 INFO  NativeLibraryLoader - loadLibraryFromJarToTemp params: libjnifuse.so,libjnifusejni-linux-aarch64.so,null
2023-03-23 03:38:38,730 INFO  NativeLibraryLoader - temp file: /tmp/libjnifuse157809989683088845.so
2023-03-23 03:38:38,730 INFO  NativeLibraryLoader - libjnifuse.so resource url: jar:file:/opt/alluxio-2.9.1-SNAPSHOT/integration/fuse/alluxio-fuse-2.9.1-SNAPSHOT.jar!/libjnifuse.so
2023-03-23 03:38:38,733 INFO  NativeLibraryLoader - libPath /tmp/libjnifuse157809989683088845.so
2023-03-23 03:38:38,733 INFO  NativeLibraryLoader - Loaded lib by jar from path /tmp/libjnifuse157809989683088845.so.
2023-03-23 03:38:38,734 INFO  NativeLibraryLoader - Loaded libjnifuse with libfuse version 2.
2023-03-23 03:38:38,734 INFO  AlluxioFuse - Added fuse mount option -obig_writes to enlarge single write request size
2023-03-23 03:38:39,091 INFO  Reflections - Reflections took 217 ms to scan 1 urls, producing 64 keys and 214 values
2023-03-23 03:38:39,132 INFO  LaunchUserGroupAuthPolicy - Initialized Fuse auth policy with launch user (id:0) and group (id:0)
2023-03-23 03:38:39,152 INFO  AlluxioFuse - Mounting AlluxioJniFuseFileSystem: mount point="/data/alluxio", OPTIONS="-oallow_other,-obig_writes"
2023-03-23 03:38:39,152 INFO  AbstractFuseFileSystem - Mounting /data/alluxio: blocking=true, debug=true, fuseOpts="[-oallow_other, -obig_writes]"
2023-03-23 04:30:23,057 DEBUG AlluxioJniFuseFileSystem - Enter: Fuse.Getattr(path=/)
2023-03-23 04:30:23,390 DEBUG AlluxioJniFuseFileSystem - Exit (0): Fuse.Getattr(path=/) in 333 ms

To Reproduce Compile release-2.9.0 separately in amd64 and arm64, build the Dockerfile under integration/docker(modify alluxio-extractor stage by the arch)

Run command in a alluxio fuse pod:

bash -x /entrypoint.sh fuse --fuse-opts=allow_other /data/alluxio /

The output is:

+ exec integration/fuse/bin/alluxio-fuse mount -n -o allow_other /data/alluxio /
Path /data/alluxio is not mounted
Starting AlluxioFuse process: mounting alluxio path "/" to local mount point "/data/alluxio"
OpenJDK 64-Bit Server VM warning: If the number of processors is expected to increase from one, then you should configure the number of parallel GC threads appropriately using -XX:ParallelGCThreads=N
2023-03-23 11:09:58,043 INFO  AlluxioFuse - Alluxio version: 2.9.1-SNAPSHOT-a2a72ee6fb0f1695d8b3aae30b3037ed2c9c9bbf
2023-03-23 11:09:58,382 INFO  AlluxioFuse - Set fuse mount point options as allow_other from command line input
2023-03-23 11:09:58,617 INFO  NettyUtils - EPOLL_MODE is available
2023-03-23 11:09:59,367 INFO  MetricsSystem - Starting sinks with config: {}.
2023-03-23 11:09:59,381 INFO  TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=172.16.0.58, rack=null)
2023-03-23 11:09:59,389 INFO  NativeLibraryLoader - loadLibraryFromJarToTemp params: libjnifuse.so,libjnifusejni-linux-aarch64.so,null
2023-03-23 11:09:59,389 INFO  NativeLibraryLoader - temp file: /tmp/libjnifuse4217764342344687918.so
2023-03-23 11:09:59,392 INFO  NativeLibraryLoader - libPath /tmp/libjnifuse4217764342344687918.so
INFO ../../src/main/native/libjnifuse/jnifuse_onload.cc:25 Loaded libjnifuse
2023-03-23 11:09:59,393 INFO  NativeLibraryLoader - Loaded lib by jar from path /tmp/libjnifuse4217764342344687918.so.
2023-03-23 11:09:59,393 INFO  NativeLibraryLoader - Loaded libjnifuse with libfuse version 2.
2023-03-23 11:09:59,393 INFO  AlluxioFuse - Added fuse mount option -obig_writes to enlarge single write request size
2023-03-23T11:09:59.430+0000: 1.698: [GC pause (Metadata GC Threshold) (young) (initial-mark), 0.0528268 secs]
   [Parallel Time: 42.4 ms, GC Workers: 1]
      [GC Worker Start (ms):  1698.5]
      [Ext Root Scanning (ms):  9.5]
      [Update RS (ms):  0.0]
         [Processed Buffers:  0]
      [Scan RS (ms):  0.1]
      [Code Root Scanning (ms):  3.7]
      [Object Copy (ms):  28.9]
      [Termination (ms):  0.0]
         [Termination Attempts:  1]
      [GC Worker Other (ms):  0.0]
      [GC Worker Total (ms):  42.2]
      [GC Worker End (ms):  1740.7]
   [Code Root Fixup: 0.1 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.1 ms]
   [Other: 10.2 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 9.1 ms]
      [Ref Enq: 0.3 ms]
      [Redirty Cards: 0.0 ms]
      [Humongous Register: 0.0 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.2 ms]
   [Eden: 54272.0K(102.0M)->0.0B(91136.0K) Survivors: 0.0B->13312.0K Heap: 54272.0K(2048.0M)->13916.5K(2048.0M)]
 [Times: user=0.06 sys=0.00, real=0.05 secs] 
2023-03-23T11:09:59.483+0000: 1.751: [GC concurrent-root-region-scan-start]
2023-03-23T11:09:59.500+0000: 1.768: [GC concurrent-root-region-scan-end, 0.0168271 secs]
2023-03-23T11:09:59.500+0000: 1.768: [GC concurrent-mark-start]
2023-03-23T11:09:59.503+0000: 1.771: [GC concurrent-mark-end, 0.0031790 secs]
2023-03-23T11:09:59.504+0000: 1.772: [GC remark 2023-03-23T11:09:59.504+0000: 1.772: [Finalize Marking, 0.0002895 secs] 2023-03-23T11:09:59.504+0000: 1.773: [GC ref-proc, 0.0003228 secs] 2023-03-23T11:09:59.505+0000: 1.773: [Unloading, 0.0159694 secs], 0.0170953 secs]
 [Times: user=0.02 sys=0.00, real=0.01 secs] 
2023-03-23T11:09:59.521+0000: 1.789: [GC cleanup 15454K->15454K(2048M), 0.0070525 secs]
 [Times: user=0.00 sys=0.00, real=0.01 secs] 
2023-03-23 11:09:59,761 INFO  Reflections - Reflections took 226 ms to scan 1 urls, producing 64 keys and 214 values 
2023-03-23 11:09:59,812 INFO  LaunchUserGroupAuthPolicy - Initialized Fuse auth policy with launch user (id:0) and group (id:0)
2023-03-23 11:09:59,859 INFO  AlluxioFuse - Mounting AlluxioJniFuseFileSystem: mount point="/data/alluxio", OPTIONS="-oallow_other,-obig_writes"
2023-03-23 11:09:59,860 INFO  AbstractFuseFileSystem - Mounting /data/alluxio: blocking=true, debug=true, fuseOpts="[-oallow_other, -obig_writes]"
INFO ../../src/main/native/libjnifuse/jnifuse_helper.cc:33 Start initializing JNIFuse
ERROR ../../src/main/native/libjnifuse/jnifuse_helper.cc:34 Validate standard errors can be logged as expected
FUSE library version: 2.9.5
nullpath_ok: 0
nopath: 0
utime_omit_ok: 0
fuse: max_idle_threads: 64
unique: 1, opcode: INIT (26), nodeid: 0, insize: 56, pid: 0
fuse: max_idle_threads: 64
INIT: 7.26
flags=0x001ffffb
max_readahead=0x00020000
   INIT: 7.19
   flags=0x00000039
   max_readahead=0x00020000
   max_write=0x00020000
   max_background=0
   congestion_threshold=0
   unique: 1, success, outsize: 40

unique: 2, opcode: GETATTR (3), nodeid: 1, insize: 56, pid: 15835
getattr /
2023-03-23 11:12:08,817 DEBUG AlluxioJniFuseFileSystem - Enter: Fuse.Getattr(path=/)
2023-03-23 11:12:09,164 DEBUG AlluxioJniFuseFileSystem - Exit (0): Fuse.Getattr(path=/) in 347 ms
   unique: 2, success, outsize: 120
2023-03-23T11:15:58.487+0000: 360.755: [GC pause (G1 Evacuation Pause) (young), 0.0517934 secs]
   [Parallel Time: 42.7 ms, GC Workers: 1]
      [GC Worker Start (ms):  360755.7]
      [Ext Root Scanning (ms):  7.8]
      [Update RS (ms):  0.2]
         [Processed Buffers:  1]
      [Scan RS (ms):  0.2]
      [Code Root Scanning (ms):  5.7]
      [Object Copy (ms):  28.8]
      [Termination (ms):  0.0]
         [Termination Attempts:  1]
      [GC Worker Other (ms):  0.0]
      [GC Worker Total (ms):  42.6]
      [GC Worker End (ms):  360798.3]
   [Code Root Fixup: 0.1 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.1 ms]
   [Other: 8.8 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 8.0 ms]
      [Ref Enq: 0.1 ms]
      [Redirty Cards: 0.0 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.3 ms]
   [Eden: 91136.0K(91136.0K)->0.0B(92160.0K) Survivors: 13312.0K->12288.0K Heap: 102.6M(2048.0M)->12380.5K(2048.0M)]
 [Times: user=0.05 sys=0.00, real=0.05 secs] 

Expected behavior Both x86-64 and aarch64 can work.

LuQQiu commented 1 year ago

@michael1589 Thanks for reporting the issue, does it happen consistently? any information in fuse.out? could be some fuse dependencies issue

michael1589 commented 1 year ago

@LuQQiu


[root@alluxio-fuse-wzvx9 logs]# cat fuse.out 
umount: /data/alluxio: not mounted
[root@alluxio-fuse-wzvx9 alluxio-2.9.1-SNAPSHOT-noUI-noHelm]# ls -lh /data/   
ls: cannot access /data/alluxio: Input/output error
total 0
d????????? ? ?    ?     ?            ? alluxio
drwxr-xr-x 2 root root 50 Mar 23 03:08 alluxio-domain
drwxr-xr-x 2 root root  6 Mar 13 07:58 clickhouse1
drwxr-xr-x 2 root root  6 Mar 13 07:58 clickhouse2
drwxr-xr-x 2 root root  6 Mar 13 07:58 clickhouse3
drwxr-xr-x 2 root root  6 Mar 13 07:58 efk-es1
drwxr-xr-x 2 root root  6 Mar 13 07:58 efk-es2
drwxr-xr-x 2 root root  6 Mar 13 07:58 efk-es3
drwxr-xr-x 3 root root 22 Mar 13 08:28 k8s
drwxr-xr-x 2 root root  6 Mar 13 07:58 kafka1
drwxr-xr-x 2 root root  6 Mar 13 07:58 kafka2
drwxrwsr-x 3 root root 24 Mar 13 08:40 kafka3
drwxr-xr-x 6 root root 66 Mar 13 08:13 minio1
drwxr-xr-x 2 root root  6 Mar 13 07:58 minio2
drwxr-xr-x 6 root root 66 Mar 20 07:32 minio3
drwxr-xr-x 2 root root  6 Mar 13 07:58 minio4
drwxr-xr-x 2 root root  6 Mar 13 07:58 minio5
drwxr-xr-x 6 root root 66 Mar 20 07:32 minio6
drwxr-xr-x 2 root root  6 Mar 13 07:58 zk1
drwxr-xr-x 2 root root  6 Mar 13 07:58 zk2
drwxrwsr-x 4 root root 30 Mar 13 08:39 zk3
drwxr-xr-x 2 root root  6 Mar 13 07:58 {alluxio-domain}
[root@alluxio-fuse-wzvx9 alluxio-2.9.1-SNAPSHOT-noUI-noHelm]# ls
LICENSE  assembly  bin  client  conf  integration  lib  libexec  logs  underFSStorage
[root@alluxio-fuse-wzvx9 alluxio-2.9.1-SNAPSHOT-noUI-noHelm]# ./integration/fuse/bin/alluxio-fuse stat
pid mount_point alluxio_path
39103   /data/alluxio   /
[root@alluxio-fuse-wzvx9 alluxio-2.9.1-SNAPSHOT-noUI-noHelm]# cd /data/alluxio
bash: cd: /data/alluxio: Input/output error
michael1589 commented 1 year ago

@LuQQiu Here's my compile procedure:

  1. I compile libjnifuse on a aarch64 machine in a docker container, whose image was built from https://github.com/Alluxio/alluxio/blob/release-2.9.0/dev/jenkins/Dockerfile-jdk8. I saw fuse3, libfuse3-dev and libfuse-dev in that image.
  2. I replaced libjnifuse.so and libjnifuse3.so in this place: https://github.com/Alluxio/alluxio/blob/release-2.9.0/integration/jnifuse/native/src/main/resources/ as soon as get them from the above stage.
  3. Then I compile alluxio separately to get a tar-ball(on x86-64/aarch64 machines) Make sure x86-64 tar-ball uses libjnifuse.so/libjnifuse3.so originally in git repo and aarch64 tar-ball uses previously built libjnifuse.so and libjnifuse3.so
  4. And at last I used https://github.com/Alluxio/alluxio/blob/release-2.9.0/integration/docker/Dockerfile as my dockerfile to generate alluxio image(Of course modified the alluxio-extractor stage).

    Is this method OK to get alluxio image? And does alluxio support hybrid deploy with x86-64 and aarch64? Have you guys ever used libfuse.so on aarch64 before? The .so was generated here: https://github.com/alluxio/libfuse/tree/fuse_2_9_5_customize_multi_threads

michael1589 commented 1 year ago

I think the key is dynamic library here: https://github.com/Alluxio/alluxio/tree/master/integration/jnifuse/native/src/main/resources. There're only x86-64 version here. I have no idea why mine fails currently. They are recompiled on a aarch64 machine.

I also suggest that It's a good way to just keep libjnifuse code in the git repo instead of putting libjnifuse.so/libjnifuse3.so/libjnifuse.dylib in the repo. Doing this would be very convenient for cross compile. Need to make sure those dynamic libraries just generate in the integration/jnifuse/native/src/main/resources directory.

LuQQiu commented 1 year ago

Yeah @michael1589 the current libfuse.so and libfuse3.so is pre-built in x86-64 and need recompile as the process you mention. Yeah it must has much better way of dealing with those dependencies e.g. dynamically build when needed.

For tarball & docker image, probably also need platform specific one, may be with a flag @ssz1997 any idea here?

ssz1997 commented 1 year ago

@michael1589 Thanks for doing all the experiments. To answer your question, we don't fully support running Alluxio on ARM. We don't have the exact full picture but there're certain functionalities not working, for example, web UI.

Just to make sure, you ended up having two Alluxio images, one for ARM, one for x86-64, correct? I believe everything including Alluxio, libjnifuse, and libfuse must be compiled in different architectures? Otherwise I don't see any problem of your way getting the images.

michael1589 commented 1 year ago

@ssz1997 Yes, I built two Alluxio images, one for ARM64, one for x86-64. I have to build Alluxio, libjnifuse, and libfuse separately on ARM64 and x86-64 exactly as you said. I skipped webUI when compiling Alluxio on arm64. And changed a header file to fix the error when compiling Alluxio/libfuse fuse_2_9_5_customize_multi_threads branch.

image
ssz1997 commented 1 year ago

@HelloHorizon FYI this issue is about running Alluxio Fuse on ARM

HelloHorizon commented 1 year ago

@ssz1997 @michael1589 we are not officially support ARM

michael1589 commented 1 year ago

@ssz1997 Very thanks for your kindly reply! I'll close this issue

hubwork0 commented 4 months ago

@michael1589

@LuQQiu Here's my compile procedure:

  1. I compile libjnifuse on a aarch64 machine in a docker container, whose image was built from https://github.com/Alluxio/alluxio/blob/release-2.9.0/dev/jenkins/Dockerfile-jdk8. I saw fuse3, libfuse3-dev and libfuse-dev in that image.
  2. I replaced libjnifuse.so and libjnifuse3.so in this place: https://github.com/Alluxio/alluxio/blob/release-2.9.0/integration/jnifuse/native/src/main/resources/ as soon as get them from the above stage.
  3. Then I compile alluxio separately to get a tar-ball(on x86-64/aarch64 machines) Make sure x86-64 tar-ball uses libjnifuse.so/libjnifuse3.so originally in git repo and aarch64 tar-ball uses previously built libjnifuse.so and libjnifuse3.so
  4. And at last I used https://github.com/Alluxio/alluxio/blob/release-2.9.0/integration/docker/Dockerfile as my dockerfile to generate alluxio image(Of course modified the alluxio-extractor stage).

Is this method OK to get alluxio image? And does alluxio support hybrid deploy with x86-64 and aarch64? Have you guys ever used libfuse.so on aarch64 before? The .so was generated here: https://github.com/alluxio/libfuse/tree/fuse_2_9_5_customize_multi_threads

hi, i have some questions i'd like to ask.

  1. libjnifuse.so and libjnifuse3.so is recompiled. and then placed in which file, is just this dir https://github.com/Alluxio/alluxio/blob/release-2.9.0/integration/jnifuse/native/src/main/resources/ or other place(such as /usr/lib64)? because i want to recompile them and then build alluxio project to get a tar-ball in the Dockerfile, but i failed when i place the recompiled *.so files in the above directory. thanks very much for your attention and looking forward to your reply.