DirtyHairy / r77-firmware-ng

An updated firmware for the Retron 77
32 stars 4 forks source link

ROMs always end up freezing with the latest builds? #10

Closed divergentdeveloper closed 2 years ago

divergentdeveloper commented 2 years ago

Hi!

I've seen this reported on the atari age forum, and I can reproduce it 100% of the time on my Retron with the latest builds. The original Retron FW does not have this issue.

I've tried version 6.6, 6.52 and 6.51 and they all seem to have this problem. I've reflashed and tried all the tips I saw in the forum like disabling the time machine, etc, and I always get the same result.

It looks like it's working fine, and sometimes you can play for minutes without any issue, but every game I've tried ends up freezing.

The ROM I use for these tests is MS pac man, I just let the attract mode play, and it always freezes up. I've seen it take as long as 13 minutes to freeze, sometimes it's only one minute. I've run the same ROM on the original FW for an hour without any freeze.

Could it be that this problem has been present for a while and no one noticed it? I've done all the checks on my side and I can reproduce this 100% of the time on the brand new Retron I opened yesterday.

I bought the R77 for this project so I don't mind doing tests or providing more info to solve this issue... and maybe I missed something in my tests and it's fine :)

akator70 commented 2 years ago

I bought a Retron in December and have played it for at least 20 hours, including Ms. Pac-Man for 10-15 minutes. I haven't had a single freeze or other issue in any game.

That you have tried several different versions makes me wonder if your Retron isn't defective. The first one I received wouldn't boot at all and had to be exchanged. The replacement is the one that has worked perfectly.

Have you tried different micro SD cards? I've found lots of issues with other systems that rely on micro SD, often changing to a different card fixes whatever problem I'm having.

EDIT: I've been using Stella 6.6. I haven't had any problems with paddles, either.

divergentdeveloper commented 2 years ago

I've been wondering as well... could be that its defective, but since I see many others reporting the same issues, I wasn't sure. And I've played MS pac man for 10-15 minutes with no issues either... I can play it for a while without noticing it since games usually don't last longer than a few minutes. But for 20 hours, I would expect you'd have seen at least a few freezes.

And the other thing that put some doubt in my mind, is that the original FW runs without any issues, except that it's a.... very limiting UI :)

I tried with the "DONT_OVERCLOCK" option off and it still crashes, and what I saw didn't look like any frame drops i'd ever seen.. it was a mess on screen. It mentioned it was something to try for "buggy" hardware, which is possible in my case. It's just that the original FW runs so well with no issues.

I just bought an harmony cart because it would circumvent the bad UI of the original FW, and will work in case I end up getting a real 2600 if the r77 isn't working out.

I got this Retron from ebay so I can't return it like amazon if i think it's defective.. especially since it's not really defective, it's only the custom FW that has issues with it, and it's hard to prove it's the HW that's the problem.

Oh and I am swapping between 3 SD cards of different brands. Same issues.

akator70 commented 2 years ago

Judging from the AtariAge threads on the subject and all of the issues people have had, I don't think your experience is unusual. My experience of it working perfectly may be the outlier.

My R77 works better than my MiSTer running the 7800/2600 core and had become my default choice for 2600. The MiSTer has more compatibility issues than my R77, the frequent video timing issues trip up the MiSTer's HDMI output, and last I checked the MiSTer 7800/2600 core isn't really friendly with original paddles, even with USB adapters that work on everything else perfectly.

divergentdeveloper commented 2 years ago

Yeah that's the experience I heard about that got me to get it. :) it's only now that I have it that I see all of these folks having issues... They are very consistent though, so that's why I was thinking maybe an issue with the latests versions or that there are different versions of the HW...

I'll try games with the stock FW and see if everything is ok. Stella 6.6 is much better, but that would be an OK compromise for me and kind of salvage my investment. :)

thrust26 commented 2 years ago

We think it might be a problem with the CPU. We overclocked it a bit to have no lags with modern ARM based games. The overclocking is still within the specs, but maybe there are CPUs with worse quality.

Therefore the next release (6.7) will have an option to disable the overclocking. Maybe that solves your problem.

DirtyHairy commented 2 years ago

The option is already there 😏 --- that's DONT_OVERCLOCK. I have never been able to reproduce those crashes, but I'll give it a try. However, as it is a spurious issue that affects only some people I suspect that it is hardware related. The new firmware uses considerably more resource and more memory.

One thing worth trying might be a different power supply.

thrust26 commented 2 years ago

@divergentdeveloper Did you already try DONT_OVERCLOCK?

divergentdeveloper commented 2 years ago

Yes , I tried DONT_OVERCLOCK. I think I saw dropped frames (ms pac man sprites didn't get cleared and ended up drawing on top of each other) and it still ended up crashing.

It'a good idea to included it as a setting in the UI :) It will be way easier/faster for me when testing...

I suspect that even 1ghz "running hot" is too much for the flimsy R77s. I just saw that there's only one vent on the bottom of the thing, so that probably doesn't help either.

I've seen the stock FW run nicely without any issues for a long time, so what I want to try next is to disable all the extra filters, time machine, etc that the new Stella offers and try to match the config of their original FW, and see if it doesn't crash.

thrust26 commented 2 years ago

Did the DONT_OVERCLOCK extend the time until a crash? Regarding the ventilation, have you tried to run it upside down?

I wonder if the chip shortage caused Hyperkin to use 2nd grade chips. IIRC we had no reports earlier on. From when is your console, is it an orange one maybe?

divergentdeveloper commented 2 years ago

It didn't seem to extend the time before the crash, but that was already varying between 1 minute and 14 minutes so it's hard to tell.. It crashed after 2 and a half minutes, so it could have been a 1 minute crash that took 2 minutes with DONT_OVERCLOCK. :)

I haven't tried to run it upside down but that's definitely on my list since I saw that the vent was on the bottom :) Maybe I'll try with a fan pushing air down the vent too. I'll do some tests and report back here with the results.

I have a black one, and I don't know how long it was in stock... I think you might be right with the 2nd grade chips, and it could be that they thought this was fine to put into production, because you dont see the crashes with the original FW. And I think they stopped producing them when they realized this :)

divergentdeveloper commented 2 years ago

Oh I just saw @DirtyHairy's note: Yes, I did try to switch power supplies.

I'm pretty sure too that it's hardware related.

divergentdeveloper commented 2 years ago

DONTOVERCLOCK

I did some testing with the unit upside down, with a fan, without all the tv effects, phosphor, etc. and it still crashes.

I tried with DONT_OVERCLOCK (see attached pic, I hope I did this right) and I still get the same result.

Sadly, it really looks like a HW issue. These batches of r77's can't seem to be able to run this FW. It's really puzzling to me since the stock FW with Stella 3.5.2 runs without any issues, but there is probably more going on with v6.6 than I can see on screen...

Unless anyone has any other ideas to try, I think I'll be stuck with the stock FW and wait for my Harmony card, unless I can find another r77 that doesn't have this issue... but that seems unlikely/difficult as they've seemed to have stopped producing them.

thrust26 commented 2 years ago

First you might want to try to switch the renderer to "Software" (you have to switch to advanced settings to do so). That disables using the GPU's hardware acceleration.

If that still doesn't help, you could try one of the first community editions (some where on AtariAge) which was based on an older Stella version (3.51 IIRC). This should emulate identical with the stock version.

sa666666 commented 2 years ago

The very last community edition based on the old Stella was 3.9.3. So that's the one to look for.

divergentdeveloper commented 2 years ago

Good idea! Stock probably doesn't OpenGLES, I will try that...

Oh I was looking for that! I thought I had seen it but couldn't find it on Github so I thought I had imagined it... I will definitely look for it as it would give me the nice UI of the CFW

divergentdeveloper commented 2 years ago

@sa666666 thanks so much! I will hunt this down

divergentdeveloper commented 2 years ago

Seems that the OP removed the links when the new version was released, I don't see any links available to get it now... :(

thrust26 commented 2 years ago

Wayback Machine to the rescue: https://www.dropbox.com/s/q2965rrzpo0jq2e/sdcard.remo.20181120-1353CB.zip?dl=0

I think that is the latest version. You have to search for the old AtariAge links, before the latest migration. http://atariage.com/forums/topic/281462-retron-77-community-build-image/

divergentdeveloper commented 2 years ago

@thrust26 niiice! many thanks I had not found that one! It's probably v.3.9.3 that I was looking for.

I also found "sdcard.remo.20190119-1727.test" in the attachments of the thread, which was v3.9.4 and I got to test it during lunch and.... no crash! It ran for an hour without any issues. I will test this more extensively but so far this is great news for me :)

I will also try 6.6 with the software rendering like you suggested.

divergentdeveloper commented 2 years ago

Crashes on 6.6 with software rendering and DONT_OVERCLOCK

DirtyHairy commented 2 years ago

How adventurous are you feeling? In order to drill down to the source of those crashes we'd need to run Stella (possibly built with debug symbols) from the command line and capture the output. This either requires a supported ethernet dongle and a SSH connection or a serial connection. The serial connection arguably is easier to set up, but it requires soldering a few wires to unused pads on the R77 board and a UART-to-USB dongle (a few bucks on amazon). If you want to go either way I'll be happy to assist you.

divergentdeveloper commented 2 years ago

If time is not an issue, I might be adventurous enough for either option :)

I might actually have one of those ethernet dongles, just saw the info on which ones would work and how to try it. I'll give it a go this weekend...

Also, reading the documentation, it seems that DONT_OVERCLOCK is for developer mode only. I don't think I had the console in that mode so I'll also try that again.

divergentdeveloper commented 2 years ago

It took 17 minutes for MS pac man to crash, but it went back to the launcher this time, with developer mode enabled. Launcher is operational afterwards... Very interesting! I've seen hundreds of hangs so far, but never did it make it back to the launcher.

thrust26 commented 2 years ago

That was the latest version with "Software" renderer, right?

DirtyHairy commented 2 years ago

Interesting indeed, but as I said, guesswork is not gonna take us anywhere.

@divergentdeveloper If you would be able to get a shell on the device this would be great. Ethernet and SSH are more hassle to set up, but you can use scp to transfer files once you've got it working (i.e. to copy and run a debug build of Stella). Serial is easier to set up if you are comfortable with modifying your device, but you can't transfer files over the serial connection (easily). You can find instructions for accessing the UART here: https://github.com/stella-emu/stella/wiki/Retron-77

divergentdeveloper commented 2 years ago

@thrust26 Exactly, with time machine off, all TV effects off... I could see alot of frames dropped, like the machine was struggling to keep up.

@DirtyHairy Yes, that's my next step :) I just wanted to make sure I tried the DONT_OVERCLOCK correctly and still got a crash.

My USB to UART dongle is arriving tomorrow, I'll let you know when I have a shell going. It seems the easiest option and not too above my skill level.. I'm guessing I can just swap the SD card back and forth to transfer files when needed? That's not too much trouble.

thrust26 commented 2 years ago

Still speculating, but that doesn't sound good. Ms. Pac Man doesn't require that much CPU performance, it should run well at 1 GHz. To me it seems like the CPU is already overheating and maybe throttling (but then it should not crash).

@DirtyHairy Do you know if the CPU is permanently running at the given frequency or if it uses DVFS?

divergentdeveloper commented 2 years ago

@thrust26 It doesn't look good either, if you want to see. I made a little 3.5.4 vs 6.6 comparison video: https://www.youtube.com/watch?v=5ABeCHBX6OM

It seems to run slower as well.

DirtyHairy commented 2 years ago

@DirtyHairy Do you know if the CPU is permanently running at the given frequency or if it uses DVFS?

No, there is no governor, the CPU runs permanently at the configured speed. If it throttles then this must be the chip itself.

DirtyHairy commented 2 years ago

My USB to UART dongle is arriving tomorrow, I'll let you know when I have a shell going. It seems the easiest option and not too above my skill level.. I'm guessing I can just swap the SD card back and forth to transfer files when needed? That's not too much trouble.

Yep, that will work. Thanks a lot!

thrust26 commented 2 years ago

@DirtyHairy Do you know if the CPU is permanently running at the given frequency or if it uses DVFS?

No, there is no governor, the CPU runs permanently at the configured speed. If it throttles then this must be the chip itself.

That doesn't leave many options, does it? With constant frequency, the CPU is not stressed more in Stella 6.x than 3.x, hardware acceleration is not used in both, so its not the GPU too. What's left? RAM?

divergentdeveloper commented 2 years ago

Got a USB2UART device that had an issue so spent most of my time today debugging that! Now that it works I still don't have a shell connection, but it's probably my bad solder using the included wires.

I'll clean up and try again tomorrow, I've got some better wires coming in as well.

DirtyHairy commented 2 years ago

Keep in mind that you need to cross RX and TX, i.e. RX goes to TX and vice versa.

divergentdeveloper commented 2 years ago

shell

Yes! That was it! Seems my soldering was fine :) I had completely forgot about this...

Many many thanks! I'm ready for the next steps then :)

DirtyHairy commented 2 years ago

Nice 😏

Now, how to proceed. When the R77 starts up it launches a dumper process, and this process in turn spawns stella. So, what we need to do is kill the dumper and stella. After this is done we are free to start stella ourselves on the terminal and observe its stdout and stderr while it runs and crashes.

First do a

# ps aux

The process list should include stella and the dumper. After that, do

# killall -9 dumper
# killall -9 stella

It is fine if the second command fails, the child process should die with dumper anyway, I just added the second command to be 100% sure. Check the process list again; the two processes should be gone now. At this point you can launch stella manually by doing

# stella /mnt/path/to/rom

Note that the SD card is mounted on /mnt, so /path/to/rom refers to the path to the (Ms. Pacman) ROM on your SD card. This will start stella and launch the ROM. After stella has crashed, the first thing is to check whether Linux is still running, i.e. whether you can still type commands. If it does, please paste the output of Stella here and also paste the output of doing

# dmesg

If Linux itself has crashed, well, that's information, too 😛 Thank you again for your help.

divergentdeveloper commented 2 years ago

Success!

When it crashed the first time, with software rendering, no overclock:

malloc_consolidate(): unaligned fastbin chunk detected
Aborted

I also tried having it crash with OpenGLES right after, no overclock, and got: Segmentation fault

I've got the output of dmesg in a text file here: dmesg.txt

DirtyHairy commented 2 years ago

Thanks alot. Nothing interesting in that dmesg. The two crashes hint at memory corruption. This may be caused by either bad hardware or a bug somewhere in the stack. Let me prepare a debug build that will give a readable backtrace. In the meantime, could you retry a few more times and check how the error message fluctuates? I honestly don't think this has anything to do with software vs. hardware rendering.

divergentdeveloper commented 2 years ago

Just got a new one with a lot more meat, here it is attached. I'll try to capture a few here today while working, and post them here if they are new and interesting.

I agree with the software vs hardware, it's just my habit of mentioning what config I changed in the tickets. :) I put back the original config with OpenGLES, TV effects and everything since it crashes more frequently this way. I'll probably put overclock back since I've had half-hour runs without crashing and that's not what we're looking for ;)

crash.txt

divergentdeveloper commented 2 years ago

Another interesting one: it didn't crash, it's still running but I've just got warnings that ends with [ 1184.929432] Fixing recursive fault but reboot is needed! warning.txt

divergentdeveloper commented 2 years ago

crash2.txt

and it crashed shortly after :)

divergentdeveloper commented 2 years ago

This third crash is almost indentical the first crash.txt crash3.txt

thrust26 commented 2 years ago

crash2 lists a "hard LOCKUP on cpu 0". How can that happen?

divergentdeveloper commented 2 years ago

A few more:

crash4.txt crash5.txt crash6.txt crash7.txt

thrust26 commented 2 years ago

Another very vague idea: Can you try a different, stronger power supply? How is the one you are using defined?

Or maybe @DirtyHairy already knows what is going on.

@DirtyHairy Do we know for sure that the original firmware is running at 1GHz?

divergentdeveloper commented 2 years ago

@thrust26 Sure. I've tried 3 so far but I don't have logs of those crashes. I can do a test run my best power supply and see the result :) Current one is a 2.1A generic adapter.

thrust26 commented 2 years ago

@thrust26 Sure. I've tried 3 so far but I don't have logs of those crashes. I can do a test run my best power supply and see the result :) Current one is a 2.1A generic adapter.

Thanks, but if you already have tested multiple adapters, I am pretty sure my idea is wrong.

DirtyHairy commented 2 years ago

Thanks alot! I am afraid this is pretty conclusive, no need for running a debug build: this is either a kernel bug or faulty hardware. None of these errors in dmesg can be caused by userspace alone, and this rules out memory corruption in Stella. As only some consoles are affected I am 99% positive that hardware is the issue, probably SDRAM.

I'll try to build a version that clocks RAM at 480 MHz (instead of 624 MHz) to see whether this works any better.

@DirtyHairy Do we know for sure that the original firmware is running at 1GHz?

Yes 😏 Besides, that option does not set the clock to 1.2 GHz explicitly, but just keeps it the way it was at boot.

thrust26 commented 2 years ago

I'll try to build a version that clocks RAM at 480 MHz (instead of 624 MHz) to see whether this works any better.

Did you increase the RAM speed too? Else the original firmware should have similar problems, no?

DirtyHairy commented 2 years ago

Did you increase the RAM speed too? Else the original firmware should have similar problems, no?

No, I think this is bad hardware. Maybe they changed the RAM chips. It is very possible that the new firmware uses more RAM bandwidth, and maybe this exposes the issue.

@divergentdeveloper I have a version of the bootloader that reduces the DRAM clock to 480 MHz. Do you have access to a linux machine and feel confident enough to write it to the SD card with dd (I'll give you the specifics), or should I prepare a full SD card image?

divergentdeveloper commented 2 years ago

@DirtyHairy I don't have a linux machine handy, but shouldn't be a problem if I did :)

If you've got the setup to prep the SD image and it's not too much trouble, I think that'd be the easiest/fastest.. If not, I can get a VM up and running later this week and do the copy.

Unless you know of a windows app that lets you browse and write to linux FS? Just used one today for something but it's read-only, and the other one I saw was commercial and cost money.