NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.27k stars 173 forks source link

What is the housekeeper ? #82

Closed benjha closed 2 years ago

benjha commented 3 years ago

Hi AIstore team,

I am trying to run AIstore in a HPC system. I am using GO 1.17.3. After executing make deploy, at some point there is an output message saying the housekeeper is not running, then it fails.

What is the housekeeper ?

make deploy
Enter number of storage targets:
5
Enter number of proxies (gateways):
1
Number of local mountpaths (enter 0 for preconfigured filesystems):
2
Select backend providers:
Amazon S3: (y/n) ?
n
Google Cloud Storage: (y/n) ?
n
Azure: (y/n) ?
n
HDFS: (y/n) ?
n
Would you like to create loopback mount points: (y/n) ?
n
Building aisnode: version=1bea20d85 providers= tags= mono
done.
+ /sw/summit/ums/gen119/aistore/src/bin/aisnode -config=/ccs/home/benjha/.ais0/ais.json -local_config=/ccs/home/benjha/.ais0/ais_local.json -role=proxy -ntargets=5
housekeeper not running, cannot reg ".dflt.mm.gc"housekeeper not running, cannot reg ".dflt.mm.small.gc"+ /sw/summit/ums/gen119/aistore/src/bin/aisnode -config=/ccs/home/benjha/.ais1/ais.json -local_config=/ccs/home/benjha/.ais1/ais_local.json -role=target
+ /sw/summit/ums/gen119/aistore/src/bin/aisnode -config=/ccs/home/benjha/.ais2/ais.json -local_config=/ccs/home/benjha/.ais2/ais_local.json -role=target
+ /sw/summit/ums/gen119/aistore/src/bin/aisnode -config=/ccs/home/benjha/.ais3/ais.json -local_config=/ccs/home/benjha/.ais3/ais_local.json -role=target
+ /sw/summit/ums/gen119/aistore/src/bin/aisnode -config=/ccs/home/benjha/.ais4/ais.json -local_config=/ccs/home/benjha/.ais4/ais_local.json -role=target
+ /sw/summit/ums/gen119/aistore/src/bin/aisnode -config=/ccs/home/benjha/.ais5/ais.json -local_config=/ccs/home/benjha/.ais5/ais_local.json -role=target
E 14:55:57.012409 err.go:118 FATAL ERROR: operation not supported
FATAL ERROR: operation not supported
E 14:55:57.012480 err.go:118 FATAL ERROR: operation not supported
FATAL ERROR: operation not supported
E 14:55:57.012924 err.go:118 FATAL ERROR: operation not supported
FATAL ERROR: operation not supported
E 14:55:57.013381 err.go:118 FATAL ERROR: operation not supported
FATAL ERROR: operation not supported
E 14:55:57.013471 err.go:118 FATAL ERROR: operation not supported
FATAL ERROR: operation not supported
Done.

Thanks

VirrageS commented 3 years ago

You can ignore the "housekeeper not running" message. This is not fatal but it isn't correct behavior as well, I'm working on the fix.

The problem is with "FATAL ERROR: operation not supported". It looks like there is something wrong with filesystem and it doesn't support some operation. Right now, it's enigmatic where this error originates from and why this happens in the first place. I'm right now working on the fix to make sure that the error has correct file and line shown. But in meantime, do you know what is underlying filesystem in your environment? - you can run lsblk -f to check that

VirrageS commented 3 years ago

Hey @benjha I've pushed some new fixes/commits into master branch. If you have a chance to build from that and run it, would be awesome. Let me know what error message you are getting.

benjha commented 3 years ago

Thanks @VirrageS,

I don't see the housekeeper error anymore. I think the FATAL ERROR: operation not supported is something I forced to happen when commenting the line that verifies setfattr in deploy/dev/local/deploy.sh given that the command is not installed.

This is the output of lsblk -f in the compute node.

NAME     MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
bb-cache 253:0    0 104.3G  0 lvm  /var/cache/fscache
bb-bb1   253:1    0   1.4T  0 lvm  /mnt/bb/benjha
nvme0n1  259:1    0   1.5T  0 disk 

Once I figure out the issue with the extended attributes, I'd like to launch AIStore on /mnt/bb/benjha which is the mount point of nvme0n1 (NVMe's parition uses XFS). I think this should be done in one of the configuration files, right ? Where can I find documentation about this ?

On the other hand, is aisfs an optional requirement ?

VirrageS commented 3 years ago

Once I figure out the issue with the extended attributes

Yeah, I think it this can be connected the extended attributes as AIStore uses them broadly and requires them to be enabled. AIStore requires following packages to be installed: gcc, sysstat and attr (see: https://github.com/NVIDIA/aistore/blob/master/docs/getting_started.md#prerequisites).

I'd like to launch AIStore on /mnt/bb/benjha which is the mount point of nvme0n1 (NVMe's parition uses XFS). I think this should be done in one of the configuration files, right ? Where can I find documentation about this ?

Yes, you can check out here https://aiatscale.org/docs/configuration. Basically, what you probably want to do is modify the content of deploy/dev/local/aisnode_config.sh (assuming you are doing make deploy). The thing which can be interesting for you are:

On the other hand, is aisfs an optional requirement ?

Yes, this is totally optional. This is the tool that lets you mount the AIStore as the directory.

benjha commented 2 years ago

Ok, looks like none of the kernel modules needed by AIStore are loaded in the system.

Thanks for your help.

VirrageS commented 2 years ago

Sounds good :) Closing the issue. If you have any further problems, let us know!