FAT not releasing blocks when files are mv'd out

Mellvik commented 1 year ago

When using mv to move a file off a FAT file system, the file space is not released and the file system becomes un-umountable.

Using cp and rm works fine. Repeatable on floppies and HD file systems.

ghaerr commented 1 year ago

What exactly is the setup to repeat this? I am guessing you are trying to rename a file across mounted filesystems? mv and the kernel sys_rename use the sys_link mechanism internally to create a hard link, before unlinking the original.

If so, it would be interesting to see what happens when renaming/linking a file across two MINIX filesystems, which also can't be supported by the kernel. The mv command relies on some kernel error returns values in order to use a copy versus link, which may interfere with what I think is happening here: an attempted link followed by unlink in both cases. I don't see any code looking for a link across mounted filesystems prohibiting an improper attempted link.

Mellvik commented 1 year ago

Hi @ghaerr, yes - it's the same on minix-to-minix. Quite serious since it screws up the source file system. The destination file (and file system) is OK.

ghaerr commented 1 year ago

it's the same on minix-to-minix.

Hmmm, strange. I can't get that to fail. Looking at the mv source, it checks for EXDEV and copies the file. I would guess that's the code path that mv is taking for MINIX-to-MINIX moves.

Mellvik commented 1 year ago

Thanks, that's good to know. Must be broken by a recent change then!

ghaerr commented 1 year ago

Here's the code in do_mknod that ends up being called from sys_rename, which calls sys_link:

        else if ((offst == (int)offsetof(struct inode_operations,link))
                    && (dirp->i_dev != ((struct inode *)mode)->i_dev)) error = -EXDEV;

The inode device numbers are compared for not equal and an EXDEV error is returned. Perhaps your problem is something to do with device or buffer device numbers... ?

When EXDEV is returned, mv doesn't print an error, and falls into the copy routine:

        ... // errno from rename()
        if (errno == EPERM && access(destname, F_OK) < 0 && !isadir(srcname))
            goto copy;

        if (errno != EXDEV) {
            perror(destname);
            continue;
        }
copy:
        if (copyfile(srcname, destname, 1))
            continue;

Mellvik commented 1 year ago

I'm sure it's a buffer level problem. I'm running L1 only (12). (I'm tracing a 'problem' where a block is being written twice, inconsequential but an itch I cannot let go).

ghaerr commented 1 year ago

I'm tracing a 'problem' where a block is being written twice

I traced down an issue that could be related - what can happen is that simultaneous activity on an inode block or even data would create a race condition where the block would be dirty, marked and started for I/O with another process writing or waiting on it. What happens is that the block is written and dirty cleared, but sync_buffers would not re-check the dirty bit and would schedule the block for I/O again, thus writing it twice.

The fix is simple (and included in the Linux 2.0 buffer code) - re-check b_dirty just before the ll_rw_blk call in sync_buffers. I haven't added this fix because i wanted to see when it was happening, so instead I discard the I/O request in make_request and report "block not dirty" there. I see this happen now and again during heavy simultaneous I/O activity.

Mellvik commented 1 year ago

Very interesting indeed, @ghaerr. This may well be part (or all) of the same problem. What I see is this, the list_buffers list is very helpful - copying a file to floppy:

running out of buffers, syncing.
List of buffers shows blocks (say) 123 - 130 plus bitmap and indir blk
All blocks are seen written by trace from the floppy driver
When the next running out of buffers happens, the last block (130) from the prev list is first on the new list and being written again. Even though the dirty flag was cleared after the first write (then set again).

this repeats for every sync until the entire file has been written, and is the same for L1 only and L1+L2 (xms untested). Last block becomes first in list and rewritten. nothing bad happens.

While debugging, I've studied the penalty for not making the metadata blocks somehow sticky. It's significant and I have on my list to xperiment with this when I have the time.

ghaerr commented 1 year ago

the last block (130) from the prev list is first on the new list and being written again.

If for some reason the application(s) is not writing a full (and aligned) block at a time, this would be normal if the application got slept for some reason during I/O.

Even though the dirty flag was cleared after the first write (then set again).

If the dirty flag is cleared and then set again, that could also be completely normal - for instance, the application could have been waiting for a locked bitmap buffer to become available after I/O complete, which slept the task, only to find that the task continued writing a portion of the same data block again.

this repeats for every sync until the entire file has been written,

Another thought would be to only start I/O on data (not bitmap) buffers on the first pass of sync_buffers, or something like that, rather than locking all buffers for write. I can imagine that locking all buffers may be a bit much when only a single new buffer is required. More instrumentation/testing may be required to fully understand what might be a better solution. In later versions of Linux, only a portion of the buffers are written, and that's done using a separate kernel task (bdflush), instead of sleeping task(s) that call sync_buffer. In this way, the overall scheduling of syncing buffers is handed off away from individual applications requiring buffers.

On another note, IIRC I thought I saw you using b_prev_lru for something unrelated to the LRU list... this may have been in the raw driver; check that as IMO changing any of the LRU pointers outside of the buffer routines themselves is dangerous!

I've studied the penalty for not making the metadata blocks somehow sticky. It's significant

Are you saying that the brelseL1_index call(s) in get_free_buffer may be a problem? I thought in my version I only allowed releasing them in a second pass to curtail any penalty. I'm interested in what you find on this. I've been running ELKS on 8 L1 and 10 L2 buffers (in QEMU) and it seems to run quite well. Obviously, not much actual disk buffers. I have not really learned yet how to approach calculating an optimum strategy for buffer counts given the new algorithms...

Mellvik commented 12 months ago

the last block (130) from the prev list is first on the new list and being written again.

If for some reason the application(s) is not writing a full (and aligned) block at a time, this would be normal if the application got slept for some reason during I/O.

Even though the dirty flag was cleared after the first write (then set again).

If the dirty flag is cleared and then set again, that could also be completely normal - for instance, the application could have been waiting for a locked bitmap buffer to become available after I/O complete, which slept the task, only to find that the task continued writing a portion of the same data block again.

I agree, it could be normal. This is however, a very controlled environment - one file, one device and an 'error' that repeats 100% predictably. I'll be back on track shortly.

Another thought would be to only start I/O on data (not bitmap) buffers on the first pass of sync_buffers, or something like that, rather than locking all buffers for write. I can imagine that locking all buffers may be a bit much when only a single new buffer is required. More instrumentation/testing may be required to fully understand what might be a better solution. In later versions of Linux, only a portion of the buffers are written, and that's done using a separate kernel task (bdflush), instead of sleeping task(s) that call sync_buffer. In this way, the overall scheduling of syncing buffers is handed off away from individual applications requiring buffers.

Yes, this is definitely the way to go and will be a next step in the process. At this point I didn't want to make more changes until the double write has been completely understood and fixed if it turns out to be a bug.

On another note, IIRC I thought I saw you using b_prev_lru for something unrelated to the LRU list... this may have been in the raw driver; check that as IMO changing any of the LRU pointers outside of the buffer routines themselves is dangerous!

Ugh - thanks for the reminder. Yes, a temporary kludge that erroneously made it into the PR. A b_seg replacement was/is needed when running raw IO and L1 cache only. It has been fixed since, but not pushed.

I've studied the penalty for not making the metadata blocks somehow sticky. It's significant

Are you saying that the brelseL1_index call(s) in get_free_buffer may be a problem? I thought in my version I only allowed releasing them in a second pass to curtail any penalty. I'm interested in what you find on this. I've been running ELKS on 8 L1 and 10 L2 buffers (in QEMU) and it seems to run quite well. Obviously, not much actual disk buffers. I have not really learned yet how to approach calculating an optimum strategy for buffer counts given the new algorithms...

Remember I'm running mostly in L1 only - a very constrained environment which makes the bugs and bottlenecks all the more visible. And VERY slow on floppies! Anyway, I've seen cases where every other read was the same metadata block - 4 times in a row.

A separate issue to be created for this challenge.

tyama501 commented 6 months ago

Is this the same issue that @ghaerr fixed on elks recently?

Mellvik commented 6 months ago

Thanks for the heads up, @tyama501 - I'll look that up, can you point me to the elks issue?

tyama501 commented 6 months ago

Hello @Mellvik

Here it is. https://github.com/ghaerr/elks/pull/1824

Thank you.

Mellvik commented 6 months ago

Fixed in PR #52. Thanks, @tyama501 .

Mellvik / TLVC

FAT not releasing blocks when files are mv'd out #28