PiSCSI / piscsi

PiSCSI allows a Raspberry Pi to function as emulated SCSI devices (hard disk, CD-ROM, and others) for vintage SCSI-based computers and devices. This is a fork of the RaSCSI project by GIMONS.
https://piscsi.org
BSD 3-Clause "New" or "Revised" License
545 stars 82 forks source link

raSCSI needs to periodically check disk write cache(s) #335

Open Pacjunk opened 3 years ago

Pacjunk commented 3 years ago

Info

Describe the issue

In my testing I have noticed that raSCSI only flushes the disk cache(s) either when it is full, or at shutdown (thanks to a recent change). This potentially leads to unflushed data sitting in the cache for extended periods when the cache is not full.

The following highlights the issue:

Someone probably needs to check this on a Mac, as I'm surprised this hasn't been noticed before if it occurs there. The unclean shutdown scenario is more common than you would think especially when the raSCSI is wired to the hosts power supply. People would just flick the switch without remembering to shut down the Raspberry pi.

Now the cache is there for performance reasons and it should also not be flushed at random times as it may slow current I/O.

I'm not familiar with the raSCSI code or Pi programming in general, but here is how I see something being implemented:

- Have a counter or register that increments with every I/O. This should not add too much overhead. Lets call it IOCOUNT.
  Doesnt matter if it wraps around.
- Have another variable that records IOCOUNT when the cache is checked. Lets call it IDLECOUNT init to 0.

- Have some process that wakes regularly (e.g. every 5 seconds) and does the following:

    debug("Checking for idle")
    If IOCOUNT = IDLECOUNT then
        ! There have been no I/Os in the last 5 seconds
        debug("Idle detected")
        If the cache is dirty then
            Flush the cache & debug("Cache flushed")
    IDLECOUNT = IOCOUNT

If the system is not idle, then this should not add too much overhead. I suppose it could break if an OS is contantly pinging a disk (e.g. quorum disk). Checking interval may need to be fine tuned in this case.

Another option might be to allow the use of write-through caching. This would probably have a performance hit on slower SD cards, but at least the data would be consistent.

akuker commented 3 years ago

I really like this idea. I think this is important to implement, since there really isn't a good way to safely shutdown the RaSCSI when its mounted internally. Thank you for the suggestion!

caver01 commented 2 years ago

Thanks for opening this issue. I believe I have seen this same issue on a mac, so adding (anecdotal) confirmation of similar issues.

In my case, I setup RaSCSI on a Pi Zero W as an internal solution for a Mac SE/30 I am restoring. I was not really considering Pi shutdown, even though I have used Raspberry Pis for years and have even built soft shutdown solutions for them. In this case, I was pretty excited to try this.

I setup multiple drive images that I was able to successfully boot. But it was very easy to mess them up. . . While running successfully, I could simply shutdown. Since the SE cannot turn itself off, you go to a RESTART dialog at which point it should be safe to cut the power. If I did that, I had about a 50/50 chance (or worse) of making my disk image unbootable.

This was a bit of a nightmare while restoring this mac. I was constantly having to start over! I quickly realized I could avoid disk image corruption by opening the RaSCSI web interface and detaching my disks BEFORE cutting power. In some cases, I could resurrect the disk image by booting from another one, then mounting the corrupt image file, then reinstalling the system, or even just updating the HDD driver. One time, however, that did not work and I had to reformat.

In no cases did cutting the power actually corrupt the Pi’s image. It only clobbered the attached drive image files which supports the idea that cached writes were getting lost. Come to think of it, I did once have a file change made before shutting down get lost, but I guess I brushed it off—there were so many repeated setups for me because of the images getting corrupted.

I posted some questions about this on 68kmla forums because I also decided to re-cap my mac’s analog board and PSU. I was not 100% sure my problem wasn’t something on the SCSI bus, like low term power or something. I was so frustrated by the situation that I de-soldered the headers on the RaSCSI so I could reverse them and try a Pi 3B+ in case my Zero was the problem, or to try external options, but my mac was not cooperating. I could not get it to run externally, so I decided I should re-cap the Mac’s analog board and PSU before trying again. I also setup BlueSCSI so I could have a working internal solution since safe shutdown was so problematic.

Yesterday, I finished the recap work, and today, landoGriffin on 68kmla forums shared this link. So I am adding my story here in case it helps.

As far as fixing this, if I cannot shut off the mac without pulling out my phone or another computer—well, that just makes the internal option completely impractical. I would hope write cacheing is cleared the moment the drive is otherwise idle. If not, I would want an option to disable write cacheing completely. I suppose that kills write performance, but at least I would be safe to shutdown when my computer says it is safe to cut the power.

caver01 commented 2 years ago

Also, not a criticism of pacjunk’s approach above—the concept of flushing the cache by completing the writes is sound, but 5 seconds is an eternity. I would think we would want to be in the 5 milliseconds range! (ok 5ms is pretty quick, but it should be faster than i can reach over and hit the power switch). Actually, this value is extremely important, so it would probably make the most sense not to pick the time arbitrarily, but to compare to write cacheing in actual drives of the era to see what was typically acceptable.

I do remember back in the day being able to disable write caching on drive controllers to avoid even the remote possibility of leaving writes on table so to speak during a power failure for example. We did this on certain critical systems to decrease the likelihood of corruption (for embedded systems mounted up in a crane for example). In any case, I suspect a little research will uncover a good starting value for idle->cache-write threshold timing.

Pacjunk commented 2 years ago

The effect on performance here is important. If you need to check the cache every 5ms, then you might as well not have it. Shutting down mid-write is always asking for trouble, and you have the issue of OS caching as well as RASCSI caching. My proposal is for the situation where you shut down the OS clean, then hit the power. There is usually a few seconds of idle in this case. It may be better with 1 or 2 seconds flush, but not milliseconds. Given the sometimes random nature of I/O, can you really say it is idle after 5ms?

Glad that someone else can reproduce it, and yes it is a pain to have to get the phone out to shutdown the raSCSI, especially when it is host powered.

rdmark commented 2 years ago

I quickly realized I could avoid disk image corruption by opening the RaSCSI web interface and detaching my disks BEFORE cutting power

A small note that as of the October release, RaSCSI will automatically detach all devices before shutting itself down, i.e. if you do something like "rasctl -X" or by other means send the SHUT_DOWN command to the server. If you're running RaSCSI as a systemd service, it is configured to do the same when it is ordered to stop, e.g. when the system shuts down. This was a partial fix for this particular issue. Of course, this doesn't help in the scenario where the Pi suddenly loses power...

caver01 commented 2 years ago

Thanks for your rational perspective, @Pacjunk. You are right. milliseconds is an order of magnitude too fast, as this is the realm of drive head seek times! One or two seconds is reasonable. I suppose just knowing there is a safe time in the first place is a start.

@rdmark I appreciate the use of the command to trigger the detach to reach a safe state, especially incorporating that step into the service shutdown. That at least means it would be safe to skip the detach steps and issue a shut down command, but it still requires a second device. Where it could be useful though is if we could setup a soft shutdown trigger on an unused GPIO pin. For an external device, you could plug it in for power, but have a power off button that would run a safe shutdown script similar to examples we have probably all setup on other Pi-powered projects.

Thanks for the dialog. RaSCSI is a cool solution with a lot of promise—especially the Dayna port ethernet. I am looking forward to diving into that next, but for a hands-free internal drive solution, this write cache issue really needs to be resolved. I am happy to test more as needed.

akuker commented 2 years ago

Has anyone observed whether the hosts properly send the SYNCHRONIZE CACHE command when they shut down? In the short term, it should be fairly easy to implement that command so that we at least flush the cache when the host shuts down.

Pacjunk commented 2 years ago

I don't see it in the trace logs. I assume the trace will log unhandled commands?

uweseimet commented 2 years ago

@Pacjunk Yes, it would. I have never stumbled upon a platform that uses SYNCHRONIZE CACHE, by the way.

All in all, periodically writing the cached data would not resolve the general issue. That the Pi crashes or is powered off can happen any time. With periodically flushing the caches you cannot prevent losing data. In addition, just the fact that RaSCSI flushes its caches does not mean that Linux immediately writes these data to the disk. From that perspective IMHO working on this ticket does not provide a reliable benefit. It might make data loss a bit less likely, but that's not worth the effort. What you can do instead, is to mount the filesystem for synchronous writes. This should at least ensure that Linux immediately writes any data.

I suggest to close this ticket, because the idea sounds fine, but there are no benefits in practice.

caver01 commented 2 years ago

Hmmm. I think to say there are no benefits in practice is a bit too far, as I can reliably ruin disk images on an internal RaSCSI. Making data loss less likely is the benefit. Not being able to make data loss impossible is not a reason to close this issue. I suspect that not addressing this at all will lead to widespread rejection of the solution for internal, and some pretty stark warnings about data loss even on external.

I know this—without any adjustment here, I won’t ever use RaSCSI as an internal drive and I’d be obligated to share that perspective whenever I read about someone going after that option.

I appreciate the suggestion though about synchronous writes. I mentioned that earlier. There was talk about that ruining performance, but having the option gives a chance to test this in practice. It also reflects the same option present on physical drives.

uweseimet commented 2 years ago

@caver01 There is always a tradeoff between performance and reliability. The more often you write, the slower the system gets. Without synchronous writes on the Linux level even a perfect solution on the RaSCSI side might not be worth a lot. Maybe even nothing, considering that Linux caches a lot of filesystem data in memory, as long as there is still free memory. Having synchronous writes with Linux and at the same time no caching at all in RaSCSI would probably be the only reliable solution. From that perspective, maybe this ticket should deal with an option to disable caching in RaSCSI. I guess this is easier to implement.

I'm wondering: Is there any other (i.e. non-RaSCSI, but similar to it) solution that does periodic writes?

uweseimet commented 2 years ago

I just checked the RaSCSI code. SYNCHRONIZE CACHE currently does nothing. It would not be a big deal to flush the cache on a SYNCHRONIZE CACHE command, just like it is already done on a STOP UNIT (eject) command. But that would not resolve the Linux caching issue and the fact that the usual drivers do not use SYNCHRONIZE CACHE.

Which option offered by physical drives are you referring to when you say "It also reflects the same option present on physical drives."? As far as I know physical drives are designed to use the remaining power from their capacitors to finish pending writes (data potentially cached by the drives themselves) when they are powered down.

https://github.com/akuker/RASCSI/issues/497 would eliminate any caching issues, by the way, because all SCSI commands would directly be passed to the Linux kernel, and the SG driver directly passes the commands to the drive. No host filesystem involved, thus no software caching involved. But image files would not work anymore when using this feature, because the commands are executed on raw device level. Memory cards would work, for instance.

caver01 commented 2 years ago

In my experience setting up storage solutions in servers (we are talking early 1990s here so it’s somewhat aligned with some of the retro systems where I am using RaSCSI) we often set jumpers or configured drive controllers in software to disable write caching to reduce the likelihood of corruption where customers were hyper-concerned about data integrity. Obviously, there was never a situation where possible corruption was acceptable, but this was definitely an option offered by the devices and we took advantage of it. At the time, we knew we were taking a performance hit doing that, but it was walking that balance. RAID solutions also helped, and a UPS made everyone feel better.

uweseimet commented 2 years ago

Yes, I see. Some drives (e.g. QUANTUM SCSI drives) offer switching off caching by software. There are mode pages for that, which you can manipulate with MODE SELECT.

uweseimet commented 2 years ago

The scenario of having the Pi powered by the host and switching off the host can be addressed up to a certain degree by implementing a custom SCSI command (or a vendor specific mode page for MODE SELECT) that shuts down RaSCSI or the whole Pi. An OS that can run scripts (or binaries) during its shutdown phase could launch a script that sends this custom command. Linux or any Unix can do that, or MiNT or MagiC for the Atari. I guess a Mac can also execute code during its shutdown. The actual shutdown process already exists as part of the RaSCSI remote protobuf interface (SHUT_DOWN operation), but currently it can only be triggered by rasctl, the web interface or the RaSCSI Control app. What's missing is a means to do the same on the SCSI level. I am going to think about adding such a custom command to RaSCSI.

Pacjunk commented 2 years ago

You can never totally eliminate the chance of corruption, but what we should be doing is reducing that possibility. The most likely scenario is where you shut down the OS, then flick the power switch (without manually shutting down the pi). Flushing the cache regularly would fix this.

As far as I am aware (I come from the server world), controllers and disks will always flush data on idle. RaSCSI does not do this, and unflushed data can sit in the cache for hours (or permanently) until the pi is shut down or the devices detached (which has been coded to flush the cache).

Personally I think it is very important to flush the cache, but if others disagree, then I humbly request that an option be added to disable the write cache completely. I would rather have the performance hit. I have been playing a bit with bluescsi (which does not do write caching) and I find performance acceptable. Never had a corruption issue either and due to lack of any management features it is never cleanly shut down.

Thanks for looking into it.

caver01 commented 2 years ago

Triggering shutdown is an interesting idea, but I am wondering how practical it is to rely on the host OS for that. One common use case for example is in classic Macs. Who writes new software for old operating systems? I am certainly not equipped to do that. There is a “shutdown items” folder to house a script/app, but that folder did not exist until Mac OS 7.5. What about folks with prior versions? Not to mention, is the timing of executing “shutdown items” even appropriate to pull a drive out from under the OS? Surely there are other tasks the OS is executing during shutdown after it launches an app or script.

This does make it less of a drop-in replacement for an actual SCSI HDD. And to that point, why don’t real SCSI drives require this? I suppose I am pointing out that minimum should at least be like-for-like functionality. Can I corrupt a real HDD by killing the power? Probably, but you NEVER see this when looking at the dialog that says “It is now safe to turn off the computer”. RaSCSI should behave reliably in this situation, but it doesn’t.

uweseimet commented 2 years ago

@caver01 I think we already answered why real drives do not require this: Either you switch the cache off (by jumper or MODE SELECT), or they use the remaining power from their capacitors to write the pending data. Any replacement solution running on an OS with filesystem cache is most likely not able to exactly do what a real drive does. You might require a solution that only a real drive can provide, but not any software-based approach. Except https://github.com/akuker/RASCSI/issues/497, which would not involve any caching, because SCSI commands are passed through to the device. As mentioned image files would not work with this solution, but that's just the point: Either you use a file-based solution, which an implicit caching issue, or you directly access the hardware without any intermediate filesystem layer involved.

caver01 commented 2 years ago

You did mention the capacitor hardware bit. Maybe a better comparison, then, is the fact that I can run a BlueSCSI device which also uses image files on the SD card and these are not getting corrupted under these same shutdown circumstances. Perhaps they implemented the same cache write triggers on power off? I dunno. It is open-source, so perhaps we can find their solution.

Pacjunk commented 2 years ago

BlueSCSI does not have its own cache like RaSCSI does, so nothing to flush there. Also no linux OS in there either. I believe the SD card library does do some caching, but the bluescsi code flushes after each group of blocks is written to the card.

akuker commented 2 years ago

BlueSCSI doesn't use a RAM cache, so its not going to have this problem. I'm assuming SCSI2SD is the same. They're not running a full OS stack.

My two cents on this issue...... I think it is definitely an issue with RaSCSI that it doesn't flush the data when its idle. IMHO, letting the cached data just hang out in RAM in perpetuity is a bug and needs to be fixed.

As a proposed experiment, I think we should try updating the disk_track_cache.cpp file to completely disable data caching. There are some online threads that suggest mmap()'ing the file, but I'm not sure that's necessary. I'm not sure its really going to matter. This will allow the operating system's file caching to take over and manage the cache. The operating system should be much better at managing cached data than this custom RaSCSI code. It appears Linux can be tuned so that data will be written to disk within a timeout period. We'll still have an issue that if you're writting data within the last few moments before shutdown, that could be lost.

From my limited research tonight - there are many opinions on the web that you shouldn't manually cache files in RAM anyway. RaSCSI doesn't do anything elaborate like trying to lookahead or anything like that. So, there really is no reason to have its own caching scheme. (Running on bare metal might be a different story.... but that support was removed from our code fork)

Let's keep the discussion going on this issue if anyone has a chance to do that experiment. I'll dig into it when I have a chance, but I'm not going to have a ton of time to investigate in the near term.

uweseimet commented 2 years ago

There is a solution for Atari users, who would like to flush the RaSCSI cache when a drive is idle: https://github.com/akuker/RASCSI/pull/644 flushes the cache on STOP UNIT. There is software for the Atari (AUTOPARK from the HDDRIVER distribution) that sends STOP UNIT to drives that have not been accessed for a configurable time in seconds. (With the next access this tool sends START UNIT for the respective drives.) Provided that STOP UNIT flushes the cache, the use of this software resolves the problem with the RaSCSI cache this tticket tries to address at least for the Atari platform. In case image files are located on a filesystem configured for synchronous writes, or in case raw device files are used as image files (e.g. /dev/hda), Linux-related cache issues can also be eliminated, as already discussed.

In addition, https://github.com/akuker/RASCSI/pull/645 flushes the cache on SYNCHRONIZE CACHE. Currently SYNCHRONIZE CACHE is doing nothing.

akuker commented 2 years ago

I ran an experiment tonight and completely removed the RAM cache. Instead, my updates use mmap to virtually map the file into memory. This should allow the linux disk cache management to work its magic without RaSCSI trying to out-smart it. It appears from the preliminary benchmark results that the performance isn't impacted significantly.

cache_benchmark_mmap cache_benchmark_baseline

I welcome anyone to look at the new branch I created https://github.com/akuker/RASCSI/tree/bug_335_cache_fix

There are probably more file systems tweaks that should be made, but this is a starting point.

@uweseimet - I'd welcome your opinion on this change. (Well, everyone's opinions ;) )

caver01 commented 2 years ago

This is really good news! Thanks for doing those tests and for taking a hard look at this issue!

Pacjunk commented 2 years ago

Thanks for this. I'll do a build and check things out. Can't really do performance tests on my system, but I can certainly check for corruption.

uweseimet commented 2 years ago

@akuker I tested read performance with an Atari TT, using a Pi 4B:

WIth caching: 1130 KB/s mmap: 1000 KB/s

(With real hardware the throghput is about 1720 KB/s.)

About 10% loss of performance is quite a lot, but not unexpected. There will always be a tradeoff between performance and data safety. (My Linux systems never crash, so personally I would opt for the performance scenario.) One reason for the drop of throughput may be caused by Linux using 4 KB memory pages. My guess is that always multiples of 4KB of data are moved, which is inefficient for the typical 512 byte sector size. In addition, mmap consumes the limited space in the page directory, which may degrade performance in general. If you implement this, I suggest to make this feature configurable, because RaSCSI is not that fast anyway. (I recently read a review in the internet pointing this out.)

From a technical perspective there should be a C++ CachingPolicy interface with initially two implementations, e.g. FilesystemCache : CachingPolicy (the current implementation) and NoCache : CachingPolicy (mmap-based implementation). This approach is clean, easy to maintain and helps to add even better caches without side-effects later. A probably even faster cache might be InMemoryCache : CachingPolicy. For a Pi with 4 or 8 GB of RAM (and future models will likely offer even more) such a cache should provide better throughput than the current filesystem-based cache. A command line option for rascsi should select the caching algorithm to be used. In the ideal case the algorithm can be selected by disk. Future policies might support separately configurable read and write caching. The default should be for performance (i.e. backwards compatible), or the community should be asked whether they want speed or safety to be the default.

caver01 commented 2 years ago

Out of curiosity, one of the early suggestions was to clear the cache after some idle threshold time. Would that not allow the advantage of cache for higher throughput while satisfying the idle risk and shutdown scenario? We may never eliminate the risk due to power outage, but the main use case is the long idle delays when no activity is present, or right before a shutdown. Is that no longer a viable option?

uweseimet commented 2 years ago

@caver01 This would most likely eliminate the performance issue, but requires a completely different implementation approach.

Pacjunk commented 2 years ago

OK, I've done some testing and although it is better, it is still not great! I still get corruption if I just flick the power off (after shutting down the OS of course). If I wait about 20 seconds after the OS shutdown, then the file contents seem to be OK, and the volume is marked clean about 70% of the time. Waiting 10 seconds, the volume seems to always be dirty. I didn't test how long I would have to wait for disk to be marked clean 100% of the time, but obviously more than 25 seconds. Maybe the Pi linux is flushing at a certain interval and I am just hitting it at different parts of the cycle.

If I detach the images before powering off the Pi, then it works 100% (but it did this before).

So it appears this "mmap" thing is caching and there still needs to be considerable time before powering off. At least the cache flush time is no longer infinite as previously.

uweseimet commented 2 years ago

I am just wondering: Is there any documentation that says using mmap() means that there is no filesystem caching anymore? mmap() maps a file into memory, but when writing to the mapped memory region, are these changes persisted to the filesystem immediately? Isn't the purpose of mmap() just to make file contents accessible in memory, instead of having to use filesystem read/write operations?

uweseimet commented 2 years ago

@Pacjunk Can you run another test, provided that you can create the setup it would need? Please launch rascsi with a raw device file (e.g. an external USB drive or memory card) as image file, e.g.

>rascsi -ID 0 -t schd /dev/sda

This reads/writes directly from/to the device, without any filesystem involvement. It would be interesting to know if this avoids any corruption.

Please use the develop branch for this.

uweseimet commented 2 years ago

@Pacjunk Another helpful test would be to create a filesystem on /dev/sda, mount it in synchronous mode and then use an image file on the mounted device. This should also avoid corruption.

Pacjunk commented 2 years ago

Done some reading. Found this...

One possible solution is, of course, to just run ext3. Another is to shorten the system's writeback time, which is stored in a couple of sysctl variables:

/proc/sys/vm/dirty_expire_centisecs
/proc/sys/vm/dirty_writeback_centisecs

The first of these variables (dirty_expire_centiseconds) controls how long written data can sit in the page cache before it's considered "expired" and queued to be written to disk; it defaults to 30 seconds. The value of dirty_writeback_centiseconds (5 seconds, default) controls how often the pdflush process wakes up to actually flush expired data to disk. Lowering these values will cause the system to flush data to disk more aggressively, with a cost in the form of reduced performance.

My testing was getting better as I approached the 30 second mark, so this makes sense. I will do some tests tomorrow night waiting over 30 seconds. If the above is a factor, I should have no issues. I can then play with those settings to see if things improve!

uweseimet commented 2 years ago

I ran two tests with the develop branch and an external SD card connected to the Pi's USB 3.0 port.

  1. With the raw device for this card: rascsi -id 3 -t schd /dev/sda1
  2. With a file on an ext3 filesystem on this card mounted with the sync option: rascsi -id 3 /mnt/test.hds

In both cases the read throughput was about 1150 KB/s. This was slightly faster than with an asynchronously mounted filesystem on the internal SD card, but more or less the same. This means that not having Linux cache anything (at least based on the setup in both cases there should not have been any Linux caching) did not make a relevant difference compared to the regualr case with the Linux filesystem cache. I already tested some time ago that whether your image file is on an internal or external SD card does not make a difference, and from that I concluded that RaSCSi itself is the bottleneck.

@akuker Looks as if any other setup is faster than the mmap() approach. Let's see whether @Pacjunk can confirm that there is no corruption in the two setups above. The raw device setup is the only one without any filesystem overhead.

Pacjunk commented 2 years ago

OK, I set vm.dirty_expire_centisecs = 200 and vm.dirty_writeback_centisecs = 100.

I then wait about 3-4 seconds (300 max from above + time to flush) after dismounting the disk from the OS before cutting the power. Came up clean everytime and the small file changes that I made were all present. I did some file copying, and while I didn't do timing before and after, it didn't "feel" any worse. Not sure whether reducing these value further will help - I need more time to test that!

So it looks like setting these parameters, along with disabling the RaSCSI cache is workable. As to whether mmap makes things better or worse - I'll leave that to the experts to decide!

uweseimet commented 2 years ago

@Pacjunk Which branch were you testing with?

Pacjunk commented 2 years ago

bug_335_cache_fix

uweseimet commented 2 years ago

Since according to my research mmap() does not disable any caching, can you please run the same tests with the develop branch? Also remember testing on the raw device and with a synchronous mount, also with the develop branch.

Pacjunk commented 2 years ago

In develop, the RaSCSI cache is enabled - I know this has issues and does not get flushed on an interval. I can't see how playing with raw devices is going to change this. Yes, mmap still does caching, but this can be controlled with the above parameters.

I may not be able to do any extensive testing until the weekend.

uweseimet commented 2 years ago

@Pacjunk Correct me if I am wrong: These parameters configure mmap, and not the filesystem?

Pacjunk commented 2 years ago

As far as I can see, its for the file system as a whole. Don't see any mention of mmap.

uweseimet commented 2 years ago

@Pacjunk If it is filesystem-related, the regular develop branch should IMHO also work fine.

@akuker Am I missing something? Why would a filesystem setting only affect the mmap() branch, but not any other branch?

Pacjunk commented 2 years ago

Have a look at https://docs.microsoft.com/en-us/azure/azure-netapp-files/performance-linux-filesystem-cache

In the develop branch, RaSCSI has its own cache (on top of the OS one). This is not flushed on a time interval, which is causing the original issue. Adjusting the file system parameters won't affect behaviour of the internal RaSCSI cache.

akuker commented 2 years ago

Sorry for the delay in getting caught up in this thread. The "mmap" method should functionally work the same as the open/seek/read/write calls. The OS is just mapping the file into the virtual memory space.

When you're using the mmap implementation, RaSCSI will immediately write the information to the "file system". Its really writing to the file system cache, then the OS will eventually push it out to the physical device. This aligns with what @Pacjunk was seeing while tuning the /proc/sys/vm/dirty_* parameters. Even if we used the open/seek/read/write functions, I believe the behavior would be about the same. The OS is still going to cache the file read/write.

Am I missing something? Why would a filesystem setting only affect the mmap() branch, but not any other branch?

This is because the default RaSCSI implementation will just leave the data in RAM until the disk is disconnected (the DiskCache::Save() function). So, in @Pacjunk 's use case at the moment the Pi loses power, there are still up to 16 tracks in RAM. Since the drive isn't being disconnected, RaSCSI doesn't know to Save() the data.

Ultimately, the idea of having a background task that writes the data to the disk during idle time. But..... the corner conditions of getting this perfect scare me a little bit. Its going to take some thinking to make sure it works in all cases.

Regarding caching the entire file in RAM, an easier way to get to a similar effect would be to have the size of the cache configurable.

An experiment I tried once upon a time was to save the disk image to /tmp, which (as I understand it), should be in RAM. There wasn't a noticeable performance increase. However, I'd welcome someone else to retry the experiment!

I'm going to run an experiment to see how bad the performance is when we run sync() after ever write. I'd imagine the performance hit is SUBSTANTIAL, but we'll see. This will force the OS to write the data to the drive immediately.

akuker commented 2 years ago

Hey! I predicted something correctly! image

This proves how important cache is!!

Pacjunk commented 2 years ago

Ultimately, the idea of having a background task that writes the data to the disk during idle time. But..... the corner conditions of getting this perfect scare me a little bit. Its going to take some thinking to make sure it works in all cases.

If doing it in RaSCSI is too hard (sounds like it), then I am quite happy with your bug_335_cache_fix combined with tuning the dirty* parameters. The tricky bit would be making it so that this behaviour could be switched on and off easily.

Pacjunk commented 2 years ago

I'm going to run an experiment to see how bad the performance is when we run sync() after ever write. I'd imagine the performance hit is SUBSTANTIAL, but we'll see. This will force the OS to write the data to the drive immediately.

Looks very low! Is the sync (I assume this is the filesystem sync, not the RaSCSI one) after a byte, a block or group of blocks? Bluescsi doesn't perform that badly and it syncs after each group of blocks (e.g. if the command is to write 20 blocks, then it syncs after the 20 blocks). I assume the sdfat library does very little caching though.

akuker commented 2 years ago

This experiment syncs after every sector. I'm not sure why the performance is so much worse than BlueSCSI. Something to think about using background brain clock cycles.

I think I'll spend some time merging bug_335_cache_fix into develop, but in a way that can be enabled/disabled. I'm not sure there is a lot of value being able to control it per-device. I think it could be a global flag.

As time goes on, someone could play with making a background task to work with the RaSCSI RAM caching to make sure it gets flushed during idle cycles. But, the current mmap approach is a huge improvement (if you can stand the minor performance hit)

Longer term, I want to do a hardware mod to add super-caps to the RaSCSI board. But, that's going to take some time, and doesn't help the hundreds of existing RaSCSI boards that are out there.

akuker commented 2 years ago

So, I think once this gets merged into develop, we can close this issue and open a new one for adding background flushing with the RAM cache.

Pacjunk commented 2 years ago

OK, I've done some tests by copying multiple 1MB files from HDD to RaSCSI. (I don't have a fancy benchmark program!). All done on a Zero-W (so slow to start with).

Release version : 323KB/sec bug_335_cache_fix : 181KB/sec

I changed the dirty_expire_centisecs from default of 3000 to 200, and it made no difference. Still got 181KB/sec

So it looks like things go a lot faster with the built in RaSCSI cache. Don't know why the mmap variation is so slow! Maybe the lack of memory on the Zero means it can't map the image very well. Wonder what performance I would get with just normal file I/O ?

BTW Ran the same test on bluescsi and got 338KB/sec, so it is possible to get reasonable performance with little/no caching.