Closed tconrado closed 4 years ago
Hi @tconrado, please specify CAS version that you are using.
hey there, it is open-cas-linux-v20.03.2.0295.tar.gz (we updated the question to reflect it)
Hi @tconrado, thanks for your report.
Can you provide some more logs from before the machine entered emergency shell? Maybe there we'll find some clues as to what happened exactly - I suspect that there might be some logs connected to mismatch of core UUID, which would indicate problem with our by-id links resolving.
Hi @tconrado. It seems to be a known issue. The problem is that if block device has multiple symlinks in /dev/disk/by-id/, Open CAS during startup procedure sometimes tries to use symlink different from the one in the config, which results in situation, when CAS recognizes device as a part of the configuration but is not able to attach it correctly. We are constantly trying to extend the list of supported devices and this particular issue should be resolved in the next release (v20.12), which is expected at the very beginning of 2021.
If you want to try fix this issue earlier, you can do it pretty easily. At casadm/cas_lib.c:1632
there is array named prefix_blacklist
. You can add to this array prefixes of all symlinks pointing to your device except the one you use in opencas.conf. That way you will remove ambiguity during startup, so everything should start working fine.
Indeed, after looking the logs, it is easy to notice that very often it does use a symlink that is not the one listed in the config file. Usually our servers do more than 180 days without a restart, so not a big deal, we will just wait. Thanks for mentioning it.
best regards.
Description
If more than two cache devices are configured, some may not complete the initialization. If the opencas.conf file is moved, the reboot is ok, moving back and using casctl start allow everything to start perfectly.
Expected Behavior
We expected to have all cas devices active after reboot in a way that allow us to mount the devices using
/etc/fstab
, but instead it force the emergency boot mode as it cannot initalize all devices.In the emergency mode, it is possible to notice that only some cas devices are active, the other have the its cores devices as inactive. The very same core devices that are inactive also are listed as detached while it should as active only;
Steps to Reproduce
Context
This bug prevent the use of openCas in systems in which server restart may be unattended and defer the purpose of the opencas.conf file; we can confirm that this bug was never observed in machines with single cache device, or even in machines in which multiple cache devices were combined using lvm
Possible Fix
We tried to filter cache and core devices from lvm, no result. Once a core device is assigned as inactive, it need to have its alias removed from the
casdm -L
list as detached; casctl settle may stop and restart successfully the device stopping cas withcasctl stop
; and removing all the detached devices withcasadm --remove-detached -d /dev/sdX
and starting again withcasctl start
allows the boot procedure to continue.Logs
systemd[1]: open-cas.service: Triggering OnFailure= dependencies. systemd[1]: local-fs-pre.target: Job local-fs-pre.target/start failed with result 'dependency'. systemd[1]: Dependency failed for Local File Systems (Pre). systemd[1]: local-fs.target: Job local-fs.target/start failed with result 'dependency'. systemd[1]: Dependency failed for Local File Systems. systemd[1]: Failed to start opencas initialization service. systemd[1]: open-cas.service: Failed with result 'exit-code'. systemd[1]: open-cas.service: Main process exited, code=exited, status=1/FAILURE casctl[1675]: Couldn't add device /dev/disk/by-id/scsi-SASR7240_cas5_BB9CCA35 as core 1 in cache 5 casctl[1675]: Couldn't add device /dev/disk/by-id/scsi-SASR7240_cas4_AC6CCA35 as core 1 in cache 4 casctl[1675]: Couldn't add device /dev/disk/by-id/scsi-1ADAPTEC_ARRAY_7744CA35 as core 1 in cache 3 casctl[1675]: Open CAS initialization failed. Couldn't set up all required devices kernel: [Open-CAS] Adding device /dev/disk/by-id/scsi-1ADAPTEC_ARRAY_0284AA35 as core core1 to cache cache1 kernel: [Open-CAS] [Classifier] Initialized IO classifier kernel: [Open-CAS] Adding device /dev/disk/by-id/nvme-INTEL_SSDPEDME016T4S_CVMD4514002U1P6KGN as cache cache kernel: cache1.core1: Successfully added kernel: cache1: Promotion policy : nhit kernel: cache1: Cleaning policy : acp kernel: cache1: Eviction policy : lru kernel: cache1: Cache mode : wb kernel: cache1: Successfully loaded kernel: cache1: Cache attached
Your Environment
sda 8:0 1 5.5T 0 disk └─cas1-1 252:1024 0 5.5T 0 disk sdb 8:16 1 5.5T 0 disk └─cas2-1 252:1280 0 5.5T 0 disk sdc 8:32 1 5.5T 0 disk └─cas3-1 252:1536 0 5.5T 0 disk sdd 8:48 1 5.5T 0 disk └─cas4-1 252:1792 0 5.5T 0 disk sde 8:64 1 5.5T 0 disk └─cas5-1 252:2048 0 5.5T 0 disk sdf 8:80 1 5.5T 0 disk sdg 8:96 1 111.8G 0 disk ├─sdg1 8:97 1 285M 0 part /boot/efi └─sdg2 8:98 1 109.9G 0 part / nvme3n1 259:2 0 1.5T 0 disk nvme2n1 259:3 0 1.5T 0 disk nvme4n1 259:4 0 1.5T 0 disk nvme1n1 259:5 0 1.5T 0 disk nvme0n1 259:6 0 1.5T 0 disk