abarisani commented 3 years ago

Hello there.

I am evaluating go-diskfs (nice project!) for use with our tamago framework to port interlock to bare metal. The integration is promising however I am stumbling upon some issues.

First of all while using io.Copy on a FAT32 file opened with OpenFile I notice that performance is pretty slow, it seems that write speed decreases as the file gets larger, I am using something similar to the following:

f, _ := storage.OpenFile(osPath, os.O_RDWR|os.O_CREATE|os.O_EXCL|os.O_TRUNC)
io.Copy(f, buf)

I think this relates to https://github.com/diskfs/go-diskfs/pull/110#issuecomment-818249267

Additionally I notice that writes to the underlying FileSystem can be of sizes which are not a multiple of its block size, I am not sure if this is intended or not.

Despite this being correct or not, for sure it's problematic that WriteAt errors are not all trapped, for example my custom block device (which reads/writes from an SD card) does not allow writes which are not a multiple of block size, the WriteAt error was raised but it's never caught in fat32.allocateSpace (also not in some instances of fat32.Create).

So my issues are the following:

can write performance be improved ?
is it correct for WriteAt to be invoked without honoring the FileSystem block size ?
all WriteAt errors should be trapped.

Thanks!

deitch commented 3 years ago

Hi @abarisani

nice project!

Thanks! It is fully OSS without corporate backing, so push it ahead whenever possible.

I notice that performance is pretty slow, it seems that write speed decreases as the file gets larger, I am using something similar to the following:

That is possible. Do you have any benchmarks? It could be the processing, it could be the read from the file, it could be the underlying filesystem (you said you were opening a FAT32 file, and not a raw block device).

Additionally I notice that writes to the underlying FileSystem can be of sizes which are not a multiple of its block size, I am not sure if this is intended or not.

Do you mean the blocksize of the filesystem on the file image? Or the blocksize of the underlying filesystem on which the image exists? I assume you mean the former, as the latter would be out of scope and handled by the OS on which it is running.

all WriteAt errors should be trapped.

Agreed. Care to submit a PR?

abarisani commented 3 years ago

We are not running any OS and mapping util.File directly on a block device which represents a microSD card. Therefore there is no underlying file system and we are not using partitions for now.

So the speed is not affected by the OS or a filesystem.

The speed decreases incrementally throughout the write, this is unrelated to the SD card.

I performed a similar test also on Linux and accessing the SD card directly through /dev and the performance is also pretty slow.

This only happens when writing and not when reading.

For the second issue the block size I am referring to is the one passed to Create, I am not sure if my expectation for it to be used in every WriteAt is correct or not, I can work this around if I am mistaken. The lack of error trapping remains and happy to submit a PR for it.

abarisani commented 3 years ago

Correction: I am not noting the incremental slowdown in Linux, only within my tamago integration (where I read the body from an HTTP client), so I need to investigate what is going on there.

On Linux the performance is faster (though still pretty slow at < 1 MB/s) but at least it's always linear.

Closing the issue until further investigation on my side confirms something wrong with go-diskfs rather than my integration of it.

Thanks!

abarisani commented 3 years ago

The following test program performs a file write, under Linux, on a file-backed FAT32 filesystem. The test program simply wraps the underlying block device to trace its WriteAt invocations.

A test against a 32GB microSD card (seen as a block device at /dev/mmcblk0), transferring a 15063751 bytes (~15M) PDF, writes it in about 18 seconds.

When performing such test I see a sequence of writes possibly responsible for the poor performance:

460 repeated writes of 3895808 bytes (~3.8M) to offset 16384 (sector 32)
460 repeated writes of 3895808 bytes (~3.8M) to offset 3912192 (sector 7641)
461 repeated writes of 32768 bytes to offset 7808000 (sector 15250)

There are 459 transfers of 32768 and 1 transfer of 23239 bytes which account for the actual 15063751 bytes of the file being copied. Of course more than the file itself is written under any filesystem scheme to account for metadata, however the list of metadata writes feels excessive.

If I repeat the same test on a 50MB FAT32 filesystem, transferring the same 15063751 bytes (~15M) PDF, I see:

462 repeated writes of 409088 bytes to offset 16384 (sector 32)
462 repeated writes of 409088 bytes to offset 425472 (sector 831)

If I repeat the same test on a 100MB FAT32 filesystem, transferring the same 15063751 bytes (~15M) PDF, I see:

462 repeated writes of 818688 bytes to offset 16384 (sector 32)
462 repeated writes of 818688 bytes to offset 835072 (sector 1631)

Again I am not a FAT32 expert so I am not sure what I am seeing here.

package main

import (
        "io"
        "log"
        "os"

        "github.com/diskfs/go-diskfs/filesystem/fat32"
)

const (
        blockSize = 512
        fsSize    = 1048576 * 100
        fsPath    = "/tmp/fat32.bin"
)

type BlockDevice struct {
        file  *os.File
}

func (b *BlockDevice) Init(path string) (err error) {
        b.file, err = os.OpenFile(path, os.O_CREATE|os.O_EXCL|os.O_RDWR, 0600)
        return
}

func (b BlockDevice) ReadAt(p []byte, off int64) (n int, err error) {
        n, err = b.file.ReadAt(p, off)
        log.Printf("ReadAt offset:%d size:%d read:%d", off, len(p), n)
        return
}

func (b BlockDevice) WriteAt(p []byte, off int64) (n int, err error) {
        n, err = b.file.WriteAt(p, off)
        log.Printf("WriteAt offset:%d size:%d written:%d", off, len(p), n)
        return
}

func (b BlockDevice) Seek(off int64, whence int) (int64, error) {
        log.Printf("Seek offset:%d whence:%d", off, whence)
        return b.file.Seek(off, whence)
}

func main() {
        dev := BlockDevice{}

        if err := dev.Init(fsPath); err != nil {
                panic(err)
        }

        fs, err := fat32.Create(dev, fsSize, 0, blockSize, "godiskfs")

        if err != nil {
                panic(err)
        }

        input, err := os.OpenFile("/tmp/example.pdf", os.O_RDONLY, 0600)

        if err != nil {
                panic(err)
        }
        defer input.Close()

        output, err := fs.OpenFile("/example.pdf", os.O_RDWR|os.O_CREATE)

        if err != nil {
                panic(err)
        }
        defer output.Close()

        _, err = io.Copy(output, input)

        if err != nil {
                panic(err)
        }
}

deitch commented 3 years ago

Excellent issue, sample code to recreate it and everything.

I ran that code, sorted the results, here is what I get for repetitions:

 482 offset:16384
 482 offset:3584
 482 offset:512
 482 offset:835072
 484 offset:1653760

Not much of a mystery here, but it does take understanding how the fat32 filesystem works.

As you write your file out, it uses allocated clusters. As soon as the number runs out, it needs to allocate some more, which means:

Read the fat table into memory (actually, it already keeps it in memory for efficiency sake)
Extend the existing cluster chain with new clusters as needed
Write the FAT table to the primary and the backup
Update the Filesystem Information Sector (FSIS) with the last allocated cluster
Update the directory entry that the file size has changed.

The sectors it uses are:

Sector 32 is the primary FAT table (offset 16384).
The backup FAT table is immediately after that, which depends on number of sectors per FAT (which IIRC depends on the size of your filesystem). In the example code you gave, the primary FAT table is at sector 32. The FAT size is 1599, which places the backup at sector 1631 (offset 835072).
The FSIS is at Sector 1 (offset 512).
The Backup Boot Sector is at Sector 7 (offset 3584). This seems strange to me, as it should not be writing to there. Probably something that gets called multiple times by accident.
Offset 1653760 is sector 3230. In this case, it is the directory file for the parent, i.e. where the directory entry is for this file.

So it does all make sense (except for the Backup Boot Sector, which surprises me).

The question is, what can be done to make it more efficient?

The first thing is to use CopyBuffer instead of Copy(), so you can control the buffer size. the exposed fat32.File just has a Write([]byte); it has no way of knowing if this is one write out of 5,000, or just a single write. So it cannot allocated 15MB of space upfront, it has to work with what it has. If it were working with writes of 1MB chunks, it would only need to allocate clusters (and everything else it does) 15 times, instead of nearly 500. Of course, that also means you are using a 1MB buffer in memory, but that may be a fair price to pay.

This is the same thing a kernel (or filesystem driver) has to deal with when you do:

cat some15MBfile > /some/other/path

You are streaming the file, the kernel has no way of knowing how big that will be. I believe there are whole areas of filesystem management that get into that.

The other thing you could do is use an fallocate or equivalent. I don't think we have anything like that in fat32, but I would be open to a PR on it. Then you could pre-allocate a file that is the target size very efficiently, and then afterwards write into it.

abarisani commented 3 years ago

Pre-allocation would be nice.

It is not clear to me why the transfers to the primary FAT table are so large, as well as dependent on the overall fileystem size. Could you shed some light on this?

deitch commented 3 years ago

It is not clear to me why the transfers to the primary FAT table are so large,

Because it has the whole FAT table in memory, and so it writes it to the underlying disk, twice (primary and backup), with each change.

as well as dependent on the overall fileystem size

Calculate it roughly as:

(Total number of clusters) = (disk size) / (cluster size)

If you have a larger disk, you need more clusters to store data, and hence the FAT table has to be larger.

There is an argument to be made for writing out just the changed sectors of the FAT table. E.g. if there are 1599 sectors for the FAT table, and you changed 1, write that one (primary and backup). But you do run a risk of getting them out of sync. I would be open to it, but we need to think it through.

abarisani commented 3 years ago

I confirm that CopyBuffer reduces the number of primary FAT table writes, however performance is still not ideal. It would be ideal for the fat32/util File API to expose a pre-allocation method (like fallocate as you say) so that the space is staged once.

vtolstov commented 3 years ago

hm, does it possible not write immediately fat table on disk, but write memory copy of table in per duration interval? something like CommitInterval/SyncInterval ? Or this does not help at all ?

abarisani commented 3 years ago

I don't know how operating systems generally handle this so I am not sure how I can provide insights. For my specific use case if there would be the option to keep the table copy in memory and only write it once the file hits Close() I guess it would solve the issue.

deitch commented 3 years ago

You are getting into kernel and filesystem design, which is not my expertise. I know enough to cause trouble.

As far as I understand it, there usually is some form of cache of anything to write (not just the system metadata, but any writes to disk), which are flushed out either when the amount of change in cache becomes too big, or after some period of time, or when orderly shutdown happens. Separately, you can also force a sync at any given moment.

This is not a trivial thing to handle here, but, if properly planned out, I am open to it.

abarisani commented 3 years ago

There are several layers of caching on any system, however I think that when evaluating a pure Go filesystem library disk or OS caching are not relevant. I think the performance should be optimized, if possible, ignoring such layers.

It seems to me that the following optimizations are possible and non mutually exclusive:

avoiding an update of the entire FAT table at each write, but rather updating only the relevant sector (with proper locking and assuming that the library has exclusive read/write access, which has to be the case, synchronization should not be an issue here).
allowing to provide a "hint" for the file size with a new CreateFile API, which would create an empty file to be written without constant updates of the table.

An alternative strategy, which you suggested, is to add your own layer of caching and keeping the fat table in memory for updates before committing it to the disk, this feels possible but only with the assumption (or better enforcement via mutex) of locking the filesystem for the duration of the write.

deitch commented 3 years ago

Makes sense. Those certainly are simpler.

I don't know that I would do CreateFile as much as a "write empty" type option. I think we should stick as close to the basic filesystem contract that exists in go, which itself closely reflects the underling one from C. These are the relevant calls in existence in os and os.File:

File.Create, which only takes a filename
File.Truncate, which is the equivalent of fallocate
[os.OpenFile(https://pkg.go.dev/os#OpenFile), which we already support

We can do Create(), but it doesn't buy us much but convenience. I think just adding Truncate() support would do lots.

abarisani commented 3 years ago

I agree, Truncate() feels like a good idea.

Frostman commented 2 months ago

255 gives us ~20 times speed up for 1G+ size

diskfs / go-diskfs

FAT32 io.Copy performance #130

255 gives us ~20 times speed up for 1G+ size