Closed piste-jp closed 1 year ago
It looks Data Path Failover (DPF) mis-detects path changes after path switch.
DPF has 4 steps like below. In this case, LTFS mis-detects the new path is same as the previous path even if it is different in the step3. So LTFS receives RESERVATION_CONFRICT
error, -21719
, continuously on step4 and finally write error is reported. This is the first trigger of error chain.
This is the sample of the log.
2023-06-15T15:55:19.377118+02:00 archive1 ltfs[64465]: 7f0ac LTFS30200I Failed to execute SG_IO ioctl, opcode = 00 (19).
2023-06-15T15:55:19.377333+02:00 archive1 ltfs[64465]: 7f0ac LTFS30246I Connection down is detected, try to reconnect (000007564E6A).
2023-06-15T15:55:22.315761+02:00 archive1 ltfs[64465]: 7f0ac LTFS30206I Cannot open device: inquiry failed (-21720).
...
2023-06-15T15:55:22.402105+02:00 archive1 ltfs[64465]: 7f0ac LTFS30206I Cannot open device: inquiry failed (-21700).
2023-06-15T15:55:22.402148+02:00 archive1 ltfs[64465]: 7f0ac LTFS30249I Opening another path for drive 000007564E6A on (null).
2023-06-15T15:55:22.403060+02:00 archive1 ltfs[64465]: 7f0ac LTFS30250I Opened the SCSI tape device 15.0.9.0 (/dev/sg8).
2023-06-15T15:55:22.403147+02:00 archive1 ltfs[64465]: 7f0ac LTFS30207I Vendor ID is IBM .
2023-06-15T15:55:22.403195+02:00 archive1 ltfs[64465]: 7f0ac LTFS30208I Product ID is 0359260F .
2023-06-15T15:55:22.403245+02:00 archive1 ltfs[64465]: 7f0ac LTFS30214I Firmware revision is 590C.
2023-06-15T15:55:22.403302+02:00 archive1 ltfs[64465]: 7f0ac LTFS30215I Drive serial is 000007564E6A.
2023-06-15T15:55:22.403360+02:00 archive1 ltfs[64465]: 7f0ac LTFS30285I The reserved buffer size of 000007564E6A is 1048576.
2023-06-15T15:55:22.417824+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -20603.
2023-06-15T15:55:22.440453+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -20612.
2023-06-15T15:55:22.440742+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -20601.
2023-06-15T15:55:22.487942+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.525455+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.592711+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.593456+02:00 archive1 ltfs[64465]: 7f0ac LTFS30271I Successfully reopened drive 000007564E6A with the same path.
2023-06-15T15:55:22.634296+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.650277+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.685140+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.685307+02:00 archive1 ltfs[64465]: 7f0ac LTFS12029E Device is not ready (-21722).
2023-06-15T15:55:22.685410+02:00 archive1 ltfs[64465]: 7f0ac LTFS11312I Revalidation process of the medium is starting. (010104)
2023-06-15T15:55:22.721984+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.722157+02:00 archive1 ltfs[64465]: 7f0ac LTFS11313E Medium revalidation failed (-21719). Unmount the tape before continuing. (010104)
2023-06-15T15:55:22.722262+02:00 archive1 ltfs[64465]: 7f0ac LTFS11750E Failed to get tape alert data from cartridge '010104JE' (-21719).
2023-06-15T15:55:22.722342+02:00 archive1 ltfs[64465]: 7f0ac LTFS14747E Failed to query the capacity data of cartridge '010104JE' (-1068).
The point is
2023-06-15T15:55:22.417824+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -20603.
2023-06-15T15:55:22.440453+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -20612.
2023-06-15T15:55:22.440742+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -20601.
2023-06-15T15:55:22.487942+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.525455+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.592711+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.593456+02:00 archive1 ltfs[64465]: 7f0ac LTFS30271I Successfully reopened drive 000007564E6A with the same path.
And the code generated these lines are https://github.com/LinearTapeFileSystem/ltfs/blob/f95a9a4463a0e3dbb0b69aa57324a78e4eb80322/src/tape_drivers/linux/sg/sg_tape.c#L771-L806.
The first 3 LTFS30205I
s are generated by the _clear_por(priv);
in the line 772, and next 3 LTFS30205I
s shall be generated by the _clear_por(priv);
in the line 787. It means ret = _raw_tur(priv->dev.fd);
in the line 773 receives 0
unexpectedly. In this case, our expectation is ret = _raw_tur(priv->dev.fd);
in the line 773 receives -21719
and execute the if
block, and refresh the persistent reservation.
But actually, LTFS executed the else
block and old reservation is remaining. So LTFS keeps receiving -21719
in the step4 and finally write fail is reported.
Indeed, the root cause is ret = _raw_tur(priv->dev.fd);
returned 0
. And I don't know why but I believe it is not a good strategy to investigate deeper here now...
From the log LTFS receives 0
in one shot. So we might be able to define work around code when LTFS receives -21719
around line 788.
Might be resolved.
Describe the bug Data path failover fails
To Reproduce
Expected behavior LTFS keep writing a file with path failover and no report for write error.
Desktop (please complete the following information):