PiSCSI / piscsi

PiSCSI allows a Raspberry Pi to function as emulated SCSI devices (hard disk, CD-ROM, and others) for vintage SCSI-based computers and devices. This is a fork of the RaSCSI project by GIMONS.
https://piscsi.org
BSD 3-Clause "New" or "Revised" License
545 stars 82 forks source link

Verify (and remove) delays that are not accounted for by the SCSI specification #1287

Closed uweseimet closed 1 year ago

uweseimet commented 1 year ago

The existing code contains delays that are suspicious. It shall be verified which of these are needed and why. These and similar timing issues might be the reason for piscsi failing with some setups (see #1098) and the tickets where bus phases are aborted and/or not enough bytes transferred. In case a delay cannot be avoided it shall be documented in the sources why this delay is needed and how this has been tested. The branch used for this ticket is "testing_issue_1287". When testing ensure that you have a backup of your disk image file.

Delays to be addressed:

Details on 2.:

The issues_1263_1278_1283 and issue_1295 branches currently do not contain delays 2. and 3. and have successfully been tested against a second PiSCSI board running the same branch, a hard drive, a card reader, a streamer (each time in initiator mode using scsidump) and an Atari TT (in target mode). The speed gain in target mode is a bit less than 1% with a Pi 4.

uweseimet commented 1 year ago

@benjamink This is the ticket for testing delays, e.g. for the daynaport emulation. If you would like to help, please check out the testing_issue_1287 branch. In the current code of this branch there is no delay. You already tested this with your SE and it failed. I would be interested in the log with trace:6. You already know the procedure ;-). @akuker Thank you for pointing out the potential issue with slow Macs. Upcoming PRs will add a more detailed comment on this.

uweseimet commented 1 year ago

@benjamink As soon as you have confirmed that the code in this ticket, branch testing_issue_1287, does not work (because it contains the offending delay change) I will revert this change (change 1. in the list above) and will apply change 2 instead. Testing should be done with your SE for now. No need to add a logfile or screenshot unless I explicitly ask for it. Note that these changes do not target the upcoming release but the one after. So no need to spend the whole night with testing ;-).

benjamink commented 1 year ago

First test fails (as expected). piscsi crashes & the Mac shows the following on the screen:

image

I will be labeling any log files with the commit hash being tested from now on:

piscsi-issue1287-branch-c61b5c003c84f0778a8d6fb4f8d9f5d8e5e67321.log

uweseimet commented 1 year ago

@benjamink No screenshots, please. They make this ticket harder to read and usually do not provide any important information for this ticket. From the log you added I do not see that piscsi was crashing. Did the console report something like "Segmentation fault"?

uweseimet commented 1 year ago

@benjamink Regardless of the answer to my previous question (we definitely need to know whether there was an actual crash and piscsi terminated with an error message) I have already committed the next change set: I have reverted the delay change. We need to make this test in order to ensure that we have a clean starting point for any further change.

benjamink commented 1 year ago

@uweseimet piscsi didn't error out, it just froze. Want me to run with tracing or debug on for the HDD device as well & see what happens? I only had tracing on for the Daynaport device.

benjamink commented 1 year ago

@uweseimet I might have some contention going on here with another device. Let me remove everything else & try again. I'll get back to you.

uweseimet commented 1 year ago

@benjamink OK. Please note that the current branch for this ticket is identical with what you successfully tested on your SE yesterday. So this should definitely be working.

benjamink commented 1 year ago

Confirmed a4b523b9163689ed3f018579b04c78143b1653d6 works

uweseimet commented 1 year ago

@benjamink OK, so we have a clean starting point. I will need some time to prepare the next change set, because I will first test it with my Atari. It would not make sense if you were testing something that does not even work with my hardware ;-).

uweseimet commented 1 year ago

@benjamink The next change set is available, tested with my Atari. Before you start testing ensure that you have a backup of the disk image file(s) you attach. The changes affect everything, not just the daynaport. If possibe, do not use the daynaport emulation at all but just the hard drive emulation. In case of an issue please attach a log (no screenshot, but just a note what happens, e.g. Mac freezes or Mac reports an error). Limit logging to your hard drive, e.g. trace:0, depending on the SCSI ID you use for the drive. And test with your SE only. My Atari is in a different room, by the way, but I don't need to climb any stairs to get there ;-).

benjamink commented 1 year ago

Commit 0db39ea19a0a96b60f3d1669946e63288a7da9f2 worked fine. I ran with this command so it was only the disk ($HASH is just the current git hash - this command is in a small script I wrote):

sudo "bin/fullspec-1287-debug/piscsi-${HASH}" -L trace:1 -id1 /home/pi/images/test-HD0-BK_SYS_7.1.hda
uweseimet commented 1 year ago

@benjamink Sounds good! What you have just tested is item 2. in my list. I just updated the change set to also include item 3. Please re-test. Running "make" without "make clean" after updating is fine for this ticket.

benjamink commented 1 year ago

Commit 70b6e69e52379a8d69e169f42cdf3937864d5c54 worked fine (again with just disk attached)

uweseimet commented 1 year ago

@benjamink Thank you. Can you please test the same binary also with the daynaport? Please confirm that you are using an optimized build, with "-O3" being displayed when building.

benjamink commented 1 year ago

Looks like they're optimized. Example:

g++ -O3 -Wall -Werror -Wextra -DNDEBUG -Wno-psabi -std=c++20 -iquote . -D_FILE_OFFSET_BITS=64 -DFMT_HEADER_ONLY -DSPDLOG_FMT_EXTERNAL -MD -MP  -DCONNECT_TYPE_FULLSPEC -c generated/piscsi_interface.pb.cpp -o obj/fullspec/piscsi_interface.pb.o
uweseimet commented 1 year ago

Yes, they are optimized. I have already pushed a new change set for testing with both disk and daynaport, but before you update please test your current binary also with the daynapor.t

benjamink commented 1 year ago

Everything including networking worked fine with commit 70b6e69e52379a8d69e169f42cdf3937864d5c54

uweseimet commented 1 year ago

@benjamink OK, please proceed with the latest change set. This one I expect not to work with the daynaport, but we'll see.

benjamink commented 1 year ago

Commit 50278522c1288431d6863384bfff974d0e512b23

uweseimet commented 1 year ago

@benjamink Thanks, just as expected, but this needed verification. The next change set will take some time to prepare from a coding perspective, and I will also need to test it with my hardware.

benjamink commented 1 year ago

Ok, no problem. I can test anything you need today (I'm in US Eastern Time) but I'll be on holiday from tomorrow until Tuesday.

uweseimet commented 1 year ago

@benjamink The next change set is available and I have tested it with my hardware. Please also test, disk and daynaport. Please also write data to the disk and verify that the operation was successful.

benjamink commented 1 year ago

Commit fc35587f0b4033d849cf25af80fece205b2ff6d9 worked both with & without Daynaport attached. I was able to read/write a text file in both scenarios & browse the Internet when Daynaport was attached.

uweseimet commented 1 year ago

@benjamink Very good! I just committed yet another change set, related to item 5. on the list. This change only affects the daynaport.

benjamink commented 1 year ago

Commit e829b2cd30a4e7916c5ca8c1b46f71f231b1c894 froze everything again including the Raspberry Pi (or at least its networking).

piscsi-issue1287-branch-e829b2cd30a4e7916c5ca8c1b46f71f231b1c894.log

benjamink commented 1 year ago

These are the last logs I have then me rebooting the Raspberry Pi. Looks like it took networking offline:

Nov  9 04:36:04 piscsi-perf475 kernel: [20810.284850] piscsi_bridge: port 2(piscsi0) entered disabled state
Nov  9 04:36:55 piscsi-perf475 kernel: [20861.356532] device piscsi0 left promiscuous mode
Nov  9 04:36:55 piscsi-perf475 kernel: [20861.356613] piscsi_bridge: port 2(piscsi0) entered disabled state
Nov  9 04:37:33 piscsi-perf475 kernel: [    0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd083]
Nov  9 04:37:33 piscsi-perf475 kernel: [    0.000000] Linux version 6.1.21-v8+ (dom@buildbot) (aarch64-linux-gnu-gcc-8 (Ubuntu/Linaro 8.4.0-3ubu
ntu1) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #1642 SMP PREEMPT Mon Apr  3 17:24:16 BST 2023
uweseimet commented 1 year ago

@benjamink Can you please try again with the update I just pushed? No need for a log if the system freezes again.

benjamink commented 1 year ago

Froze again - 07f9383bffd9206dd8748e007cbc7b90a1ddb465

uweseimet commented 1 year ago

@benjamink OK, this means that items 1. and 5. (both are related) need further investigation. But not now, and not before your vacation ;-). Thank you once again for spending your time on this analysis. Enjoy your vacation, see you next week!

benjamink commented 1 year ago

@uweseimet Thank you! Looking forward to you having everything 100% fixed on Tuesday 😄

uweseimet commented 1 year ago

@benjamink I hope you enjoyed your vacation. I'm afraid not everything is fixed ;-). But there is an updated testing_issue_1287 branch at least, potentially affecting both the DaynaPort and hard drives. After testing, please just provide your verdict. No screenshots or logs, in order to keep this ticket clear. Thank you!

benjamink commented 1 year ago

@uweseimet thanks! I gave 2f4354095df3577618257dae0e00e48ef6575bed a test. It works fine with HDD only but it freezes on boot when I include Daynaport. Same thing as before, just freezes, no crash or error.

uweseimet commented 1 year ago

@benjamink Thanks. Any change with my latest commit?

benjamink commented 1 year ago

Yes, looks like 785b0b30e6cd93ac946392f1a50201cff90d7c60 fixes it. Networking is working again.

uweseimet commented 1 year ago

@benjamink This is excellent news! It means that we do not need the proprietary Pi hardware timer to get the work-around needed for the DaynaPort right.

uweseimet commented 1 year ago

@benjamink Can you please test the new "improve_daynaport_delay" branch with the DaynaPort? This branch is identical with the current develop branch, except for exactly the change you just tested. We should double-check.

benjamink commented 1 year ago

Confirmed commit 7b8ae25005ce0f5ead725010c25163ad761d1916 in the improve_daynaport_delay branch works fine with Daynaport.

uweseimet commented 1 year ago

@benjamink I'm afraid I have to ask you for yet another test with the improve_daynaport_delay branch, after updating. Our code analysis tool complained.

benjamink commented 1 year ago

@benjamink I'm afraid I have to ask you for yet another test with the improve_daynaport_delay branch, after updating. Our code analysis tool complained.

No problem at all. I have the machine setup next to my desk so I can run tests easily whenver you need them. I'll update & try again now.

benjamink commented 1 year ago

Commit 3bf0a06ead38bae9516a50cf4ba783bb1db48762 still works with Daynaport

uweseimet commented 1 year ago

@benjamink Please note that further testing should happen again on the testing_issue_1287 branch. There is a new changeset, with the latest commits to the develop branch included. Note that the binaries are no longer in cpp/fullspec/bin but in cpp/bin. Nothing should have changed regarding the daynaport.

benjamink commented 1 year ago

Commit 22dfa0c1958734ed65516bb9418d7b319ebd7bc3 of the testing_issue_1287 works fine with networking enabled.

uweseimet commented 1 year ago

@benjamink There is a new branch scsi_compliant_handshake. This branch is identical with the current develop branch except for changes covering item 2. of my list, related to the SCSI handshake. You have already tested these changes, but before creating a PR, please test scsi_compliant_handshake (both hard drive and daynaport), to ensure that everything works before a potential merge. Thank you!

benjamink commented 1 year ago

Worked fine with networking enabled.

Branch: scsi_compliant_handshake - Hash: 08e0f447b4da516342f0cb1a9ba9fa7069d94688

uweseimet commented 1 year ago

@benjamink There is a new branch testing_issue_1287_2, which is identical with the current develop branch except for one kind of change. When you have time, I would appreciate a test with hard drive and daynaport.

benjamink commented 1 year ago

@uweseimet worked fine w/ Daynaport enabled

Branch: testing_issue_1287_2 - Hash: f02d0d95b52b255596a71ee27539458e4460f7ad

rdmark commented 1 year ago

I recommend asking for testing on more platforms in Discord. We have users on f.e. DEC workstations, RISCOS, Apple II etc. Sampler devices may be impacted as well as they are notoriously quirky.

If the suspicious delays originate in the RaSCSI codebase, they may have been workarounds for NEC PC-98, X68k or MSX2 quirks.

Is there anything concrete that we gain in compatibility or performance by making these low-level changes? I'm thinking of the tradeoff between standards compliance and the risk of breaking something for the broader user base.

uweseimet commented 1 year ago

@rdmark Please see my notes in the related ticket: Why are the data signal lines of the PI considered unstable, but all the other lines are considered stable? Are the 8 lines used for the data byte special (less reliable)? This IMO does not make any sense from a technical perspective, does it? This is why I claim that just the fact that the RaSCSI/PiSCSI hardware/software works proves that this change is justified. The original comment implies that this delay is needed because of the Pi, not because of the connected computers/samplers. The handshake with ACK/REQ is what ensures the data are correctly transferred, not timings. Where you do need a precise timing is synchronous SCSI, because you do not handshake there. But piscsi is asynchronous.

Regarding the original RaSCSI codebase: Some drivers/computers/workstations that work today did not work at all with it ;-).

In general, I am increasingly interested in feedback from users with exotic platforms. I now know much more about the low level bus code than before and this might help me with addressing these issues, as long as the users reporting them can reproduce them and are willing to compile test branches.

This change just gives a1% performance improvement, but a very important benefit is that one can completely get rid of using the system timer. This makes porting PiSCSI to other hardware without this time (maybe even the Pi 5) much easier. And as far as I can tell this was also one of the obstacles when attempting to port to the Banana Pi. And it reduces compile times ;-).

What I suggest is to introduce a switch (macro) to enable/disable this change. For the develop branch the change should be enabled (so that anybody can easily test it and is testing it by default), but by enabling the macro in config.h you can disable it. One might just switch back with the next release, or maybe not.

By the way, remember that there is at least one platform which does not work because of delays not covered by the SCSI specs, see https://github.com/PiSCSI/piscsi/issues/1098.

Important: Even though my words above may not reflect this I fully understand your concerns. But the argument with only the data pins being considered instable and the others not is IMO a very strong one. I claim that this particular delay does not improve compatibility, but it does just the opposite. Currently there is nothing that proves this claim wrong.

uweseimet commented 1 year ago

@rdmark I added the switch mentioned above to the scsi_compliant_handshake branch. This switch also controls another related timing change, which has also already been tested by @benjamink and me. With this switch enabled the proprietary Pi system timer is not used anymore in the SCSI controller. The only remaining use (waiting for a 3 s timeout in scsidump) is going to be replaced in one of the scsidump tickets in progress by a standard C++ timer.