LinearTapeFileSystem / ltfs

Reference implementation of the LTFS format Spec for stand alone tape drive
BSD 3-Clause "New" or "Revised" License
250 stars 75 forks source link

Failure of data path failover #409

Closed piste-jp closed 1 year ago

piste-jp commented 1 year ago

Describe the bug Data path failover fails

To Reproduce

  1. Keep writing a file (with 2 paths to the drive)
  2. Power down a FC switch in the middle of a path

Expected behavior LTFS keep writing a file with path failover and no report for write error.

Desktop (please complete the following information):

piste-jp commented 1 year ago

It looks Data Path Failover (DPF) mis-detects path changes after path switch.

DPF has 4 steps like below. In this case, LTFS mis-detects the new path is same as the previous path even if it is different in the step3. So LTFS receives RESERVATION_CONFRICT error, -21719, continuously on step4 and finally write error is reported. This is the first trigger of error chain.

  1. Detect path lost or path down
  2. Reopen new path to the same drive
  3. Detect the new path is same as the previous path or different one 3-1. Make a persistent reservation again on new path if it is different from the previous path
  4. Revalidate tape

This is the sample of the log.

2023-06-15T15:55:19.377118+02:00 archive1 ltfs[64465]: 7f0ac LTFS30200I Failed to execute SG_IO ioctl, opcode = 00 (19).
2023-06-15T15:55:19.377333+02:00 archive1 ltfs[64465]: 7f0ac LTFS30246I Connection down is detected, try to reconnect (000007564E6A).
2023-06-15T15:55:22.315761+02:00 archive1 ltfs[64465]: 7f0ac LTFS30206I Cannot open device: inquiry failed (-21720).
...
2023-06-15T15:55:22.402105+02:00 archive1 ltfs[64465]: 7f0ac LTFS30206I Cannot open device: inquiry failed (-21700).
2023-06-15T15:55:22.402148+02:00 archive1 ltfs[64465]: 7f0ac LTFS30249I Opening another path for drive 000007564E6A on (null).
2023-06-15T15:55:22.403060+02:00 archive1 ltfs[64465]: 7f0ac LTFS30250I Opened the SCSI tape device 15.0.9.0 (/dev/sg8).
2023-06-15T15:55:22.403147+02:00 archive1 ltfs[64465]: 7f0ac LTFS30207I Vendor ID is IBM     .
2023-06-15T15:55:22.403195+02:00 archive1 ltfs[64465]: 7f0ac LTFS30208I Product ID is 0359260F        .
2023-06-15T15:55:22.403245+02:00 archive1 ltfs[64465]: 7f0ac LTFS30214I Firmware revision is 590C.
2023-06-15T15:55:22.403302+02:00 archive1 ltfs[64465]: 7f0ac LTFS30215I Drive serial is 000007564E6A.
2023-06-15T15:55:22.403360+02:00 archive1 ltfs[64465]: 7f0ac LTFS30285I The reserved buffer size of 000007564E6A is 1048576.
2023-06-15T15:55:22.417824+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -20603.
2023-06-15T15:55:22.440453+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -20612.
2023-06-15T15:55:22.440742+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -20601.
2023-06-15T15:55:22.487942+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.525455+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.592711+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.593456+02:00 archive1 ltfs[64465]: 7f0ac LTFS30271I Successfully reopened drive 000007564E6A with the same path.
2023-06-15T15:55:22.634296+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.650277+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.685140+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.685307+02:00 archive1 ltfs[64465]: 7f0ac LTFS12029E Device is not ready (-21722).
2023-06-15T15:55:22.685410+02:00 archive1 ltfs[64465]: 7f0ac LTFS11312I Revalidation process of the medium is starting. (010104)
2023-06-15T15:55:22.721984+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.722157+02:00 archive1 ltfs[64465]: 7f0ac LTFS11313E Medium revalidation failed (-21719). Unmount the tape before continuing. (010104)
2023-06-15T15:55:22.722262+02:00 archive1 ltfs[64465]: 7f0ac LTFS11750E Failed to get tape alert data from cartridge '010104JE' (-21719).
2023-06-15T15:55:22.722342+02:00 archive1 ltfs[64465]: 7f0ac LTFS14747E Failed to query the capacity data of cartridge '010104JE' (-1068).

The point is

2023-06-15T15:55:22.417824+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -20603.
2023-06-15T15:55:22.440453+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -20612.
2023-06-15T15:55:22.440742+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -20601.
2023-06-15T15:55:22.487942+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.525455+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.592711+02:00 archive1 ltfs[64465]: 7f0ac LTFS30205I TEST_UNIT_READY (0x00) returns -21719.
2023-06-15T15:55:22.593456+02:00 archive1 ltfs[64465]: 7f0ac LTFS30271I Successfully reopened drive 000007564E6A with the same path.

And the code generated these lines are https://github.com/LinearTapeFileSystem/ltfs/blob/f95a9a4463a0e3dbb0b69aa57324a78e4eb80322/src/tape_drivers/linux/sg/sg_tape.c#L771-L806.

The first 3 LTFS30205Is are generated by the _clear_por(priv); in the line 772, and next 3 LTFS30205Is shall be generated by the _clear_por(priv); in the line 787. It means ret = _raw_tur(priv->dev.fd); in the line 773 receives 0 unexpectedly. In this case, our expectation is ret = _raw_tur(priv->dev.fd); in the line 773 receives -21719 and execute the if block, and refresh the persistent reservation.

But actually, LTFS executed the else block and old reservation is remaining. So LTFS keeps receiving -21719 in the step4 and finally write fail is reported.

Indeed, the root cause is ret = _raw_tur(priv->dev.fd); returned 0. And I don't know why but I believe it is not a good strategy to investigate deeper here now...

From the log LTFS receives 0 in one shot. So we might be able to define work around code when LTFS receives -21719 around line 788.

piste-jp commented 1 year ago

Might be resolved.