LinearTapeFileSystem / ltfs

Reference implementation of the LTFS format Spec for stand alone tape drive
BSD 3-Clause "New" or "Revised" License
255 stars 76 forks source link

Mount fails because "LTFS17285E Failed to search the final index in IP (1)" even when ltfs can try to search on DP. #479

Open amissael95 opened 3 months ago

amissael95 commented 3 months ago

Describe the bug

When a tape cartridge with a write permanent error is trying to be mounted and the MAM (cartridge memory) attribute of the Index Partion (IP) stores a generation number lower than the MAM attribute of the Data Partition (DP), the mount process fails with error LTFS17285E even when ltfs can still search for the index in the DP.

Following logs shows that scenario.

LTFS11005I Mounting the volume.
LTFS30252I Logical block protection is disabled.
LTFS11333I A cartridge with write-perm error is detected on IP. Seek the newest index (IP: Gen = 26, VCR = 152) (DP: Gen = 27, VCR = 252) (VCR = 180).
LTFS17283I Detected unmatched VCR value between MAM and VCR (152, 180).
LTFS17284I Seaching the final index in IP.
LTFS17285E Failed to search the final index in IP (1).
LTFS14013E Cannot mount the volume.

To Reproduce

  1. Select a tape with a write permamnt error to be mounted
  2. Look for message LTFS11333I and confirm that the IP Generation is lower than than the DP Generation:
    LTFS11333I  A cartridge with write-perm error is detected on %s. Seek the newest index (IP: Gen = %llu, VCR = %llu) (DP: Gen = %llu, VCR = %llu) (VCR = %llu)." }
  3. The mount process fails with LTFS14013E since Failed to search the final index in IP (LTFS17285E)

Note: It is hard to reproduce since as mentioned above the tape cartridge needs to have a write permanent error.

Expected behavior A clear and concise description of what you expected to happen.

It seems the issue can be solved by making the _ltfs_search_index_wp@ltfs/src/libltfs/ltfs.c process to continue searching on the DP even if the search on the IP fails. (It can be done by setting can_skip_ip = true).

https://github.com/LinearTapeFileSystem/ltfs/blob/7271446b55eea437795c23576d6ac204ca72e8af/src/libltfs/ltfs.c#L1464-L1507

Additional context

This makes me to ask, was there any reason to avoid the index to be searched on the Data Partition?

The "can_skip_ip" flag was explicitaly set to false in the following commit https://github.com/LinearTapeFileSystem/ltfs/commit/328785064e1820108278cec66283a03c16fe8908, was there any special reason to do that?

piste-jp commented 3 months ago

It looks a bug.

The blocks

https://github.com/LinearTapeFileSystem/ltfs/blob/7271446b55eea437795c23576d6ac204ca72e8af/src/libltfs/ltfs.c#L1661-L1663

and

https://github.com/LinearTapeFileSystem/ltfs/blob/7271446b55eea437795c23576d6ac204ca72e8af/src/libltfs/ltfs.c#L1690-L1693

shall be swapped.

Upper code belongs to the logic that handles WP happens on IP. So index on IP might corrupted, thus skip flag shall be true.

But lower code belongs to the logic that handles WP happens on DP. The index shall be searched from IP. So skip flag shall be false;

amissael95 commented 3 months ago

Hello @piste-jp,

Thanks for quick response. I am curious. Could we just remove the "can_skip_ip" flag and let _ltfs_search_index_wp function to try to search the index on both, in the DP and DP?

At the end the logic consists in using the latest index on tape, so it does not hurt to simply try to search the index on both partitions, mark the index as 0 in case some searching failed, and use the latest index.

Regards

piste-jp commented 3 months ago

Could we just remove the "can_skip_ip" flag and let _ltfs_search_index_wp function to try to search the index on both, in the DP and DP?

I believe it's little bit dangerous. Because the block starts from L1680 means the tape says IP has the latest index. So an index on IP must be existed at least. Why do we provide a skip flag or obsolete the skip flag and always allow the skip?

https://github.com/LinearTapeFileSystem/ltfs/blob/7271446b55eea437795c23576d6ac204ca72e8af/src/libltfs/ltfs.c#L1680-L1695

Your proposal might relax acceptable tape condition a little bit it just ignores unexpected behavior of tape drive or LTFS itself. I believe we need to understand why it happens if that really happens. And fix it correctly. But your proposal just hide that fact with any knowledge.

I believe it's not time to do that at this time.

amissael95 commented 2 months ago

@piste-jp,

I have created the following PR https://github.com/LinearTapeFileSystem/ltfs/pull/480 with the modifications that you pointed out.

Do you think we can ensure that the change will not break the tape, and it is safe to be Implemented? I am currently trying to replicate this scenario using itdt... I think the only problem is in case we write incorrect index data into the tape.

In addition, it is good to emphasize that this involves a "data lost" scenario, since the index found will not point to all files within the tape.

Regards

piste-jp commented 2 months ago

I have created the following PR #480 with the modifications that you pointed out.

Do you think we can ensure that the change will not break the tape, and it is safe to be Implemented? I am currently trying to replicate this scenario using itdt... I think the only problem is in case we write incorrect index data into the tape.

For PR discussion, you need to use the comment thread on the PR. Let's use #480.

In addition, it is good to emphasize that this involves a "data lost" scenario, since the index found will not point to all files within the tape.

I cannot understand this ... Why?

amissael95 commented 2 months ago

I cannot understand this ... Why?

What I meant is because the write perm error I am not sure if we can trust the state of the indexes within the tape. According to the LTFS standard v2.4:

A volume that has been locked because a permanent write error "shall be mounted as read-only using the highest generation index available on the tape in either partition"

Is it possible that the highest index found available within the tape corresponds to a previous generation and therefore do not specify the latest files within the tape?

Could you confirm if after write perm error and successfully find the latest index in either partition that index will always point to the latest file within the tape?

Really appreciate your support

Regards

piste-jp commented 2 months ago

First of all, data lost or data loss is really strong word for storage engineers. They must be used only when data that is once written on medium disappear unexpectedly in some reasons. So, we have to say this is a data loss only when that happens because of a bug of LTFS's logic.

In this case, it is clear that your scenario is not a data loss problem at all. Because LTFS never write (or overwrite) anything at read only mount process.

Second, it looks you pointed out the scenario based on reading through only the mount process logic. I believe it is not a correct approach. You need to understand the implementation of write side.

Long story short, when LTFS gets a write perm from the drive, LTFS writes down an index to another partition, writes current index information on MAM and marks the tape as single write perm tape. So the latest index shall be read by the tape drive at mount time.

Is it possible that the highest index found available within the tape corresponds to a previous generation and therefore do not specify the latest files within the tape?

The drive returned a GOOD response after writing latest index on tape. So LTFS marks it is single write perm tape. The drive must find the latest index correctly or return read perm error from specification point of view.

Could you confirm if after write perm error and successfully find the latest index in either partition that index will always point to the latest file within the tape?

I can review if you provide such code. But honestly, it's not sure I need to do this because,

  1. The required information is already logged
  2. May be final result (mount with an index that is found this scan) is same

I believe reporting mount error (and fail) when read index generator is matched to the one on MAM is no benefit to users.

perezle commented 2 months ago

Nice talking to you @piste-jp. Yes we will take care of the pull request. Thanks!