MiSTer-devel / Main_MiSTer

Main MiSTer binary and Wiki
GNU General Public License v3.0
3.02k stars 324 forks source link

Display interference that occurs on menu/terminal during network transfer (when using video_mode=8 - 1920x1080@60Hz) #482

Closed movisman closed 7 months ago

movisman commented 2 years ago

Hi there,

Hope you are doing well.

I'd like to report an issue with display interference that seems to occur on the static background when viewing the main OSD, or when you are at the terminal, or running a script like update_all. No issues when running an actual computer or console core. During my observations, the issue only occurs where there is busy network traffic happening - a good example is sending a large file over FTP to the MiSTer.

I have done some relatively extensive testing to eliminate my setup.

The issue happens when using either ethernet from the DE10-Nano, or a USB WiFi adapter. It makes no difference.

When you transfer a large file over FTP, the background will occasionally flicker / jump around / display artifacts at random occurences. There will also be a thin single pixel line which appears on the right hand side. Sometimes the flickering is worse than others. See this example:

https://photos.app.goo.gl/rXderQRAGhgWAwo89

It doesn't happen if you start a network transfer when running a computer or console core, but it does occur on the menu core when a flat image is used as a background. Only the background is affected with interference/artifacts - the OSD itself remains artifact free. It doesn't make any difference if you use any of the built in backgrounds or any custom background, the issue will still occur. It doesn't appear to happen when the static fuzz is used. The issue will also occur when you are at the terminal, so long as some output is on the screen. As you can see on the video, it is much worse on the terminal rendering it unusable.

If you stop the transfer, the display instantly returns to normal.

The issue only occurs when video mode 8 is configured in MiSTer.ini (1920x1080@60hz). It doesn't happen at all in 1920x1080@50hz (mode 9). It also doesn't happen @ 1280x720 or 1366x768 at either 50 or 60hz.

To eliminate my setup and during initial investigation, I did try the following when trying to work out what was going on:

2.4ghz and 5ghz networks Two different WiFi adapters Tested all different USB ports on the hub Tested two different 4A Meanwell PSUs Removed all adapters and tried ethernet only, same result Three different HDMI cables (all good quality) Two different displays (1080p native DELL monitor and 4k native Samsung TV) Uploading from two different laptops with different WiFi cards

No change to behaviour.

For reference, I am running a USB hub 2.1 and Digital IO Board v1.2 along with the DE10-Nano, plus a 4A Meanwell PSU. Latest MiSTer build across the board.

My workaround for now is to use video mode 9 - 1080p @ 50hz for the main/menu core. This works with no issue.

However, as 1920x1080@60Hz would be a very common default choice for HDMI users, it might be good to fix it as people might think something else is wrong.

I have also tested an older SD card I had which was a MiSTer build from the end of August. This only shows extremely minor artifacting when transferring over the network - it's barely noticeable and terminal is also usable. I then updated this build to the latest, and all of a sudden the artifacting is much worse again (same as the video above). So I believe something has changed to cause this to happen.

Let me know if you need any more information.

Thanks!

sorgelig commented 2 years ago

Probably DDR3 memory hits the bandwidth limit when ARM is busy by moving a large memory chunks... I'm afraid nothing can be done here.

movisman commented 2 years ago

Hi,

Thanks for the reply. Understood. It is a minor issue but thought I should flag it.

I can always change the video mode to 9 (1920x1080 @ 50hz), which doesn't suffer at all, both my monitor and TV sync at 50hz just fine.

One query, why is the interference so much worse now than it was a few builds ago? Is it simply because there is more DDR3 memory in use than before, thus hitting the bandwidth sooner? Or is there more to it?

Thanks a lot!

sorgelig commented 2 years ago

Well, it can be not a bandwidth issue but metastability. It may happen on one de10 board but won't happen on other. May be network access makes FPGA chip more hot due to intensive ARM/Cache/Memory usage. It also may not be exactly due to heat.. This is complicated issue.. If you can compile Menu core yourself then you can try to compile with different SEED value which may shift metastability to a better way for your de10 board.

movisman commented 2 years ago

Hi,

Thanks again for the info. I'm not familiar with metastability nor have I compiled my own core before - unfortunately neither are topics i'm currently familiar with. However i'm interested in a basic understanding of what metastability is with FPGA's so I will have a read on the topic.

As for the FPGA getting hot, the issue will occur as soon as the DE10-Nano is powered on after being off for many hours, so unless it gets hot very quickly, i'm not convinced it's overheating and causing artifacts. It also has a heatsink and a Noctua pointing directly at it. But, if I swap the display mode to 50hz or any other resolution at 50 or 60hz, the issue instantly goes away, even when things get busy/intensive with the network and the DE10-Nano has been on for hours.

I did a couple of quick tests, I took release_20210711.rar and release_20211112.7z and flashed them to my spare SD card using the utility. Both run menu from 20210315.

Both exhibit the issue but 20210711 to a lesser extent. The latest 20211112 release there was a lot more noise.

Not sure why this is - however it's probably quite random. If I go back to my 'production' SD card which is on latest, there is just as much noise as on the fresh SD install of 20211112 which I tested.

It doesn't affect the usage of the device - it would be annoying if there was network activity and it produced noise when running a computer/console core, but it only happens on the menu and terminal.

It doesn't necessarily indicate a 'fault' on my DE10-Nano though, would that be correct? Is it something I should be concerned about? Or is it just different characteristics between boards? It is very stable and never hangs up or crashes.

Does yours not show any artifacts on video_mode=8 btw during network activity? Maybe my board is a bit unique! :)

Thanks!

sorgelig commented 2 years ago

It doesn't indicate a faulty board. FPGA is a plain array of logic cells. Unlike CPU system where each state is known and one step comes strictly after another step, FPGA is where everything happens in parallel like in real circuit. In some cases delay of specific signal may break normal work. Delays depend on several factors like voltage fluctuation or heat. Unlike traditional PCB where circuit is routed specifically with some manual work (to achieve enough tolerance or synchronization), FPGA routing is fully done by machine (because no human will be able to route thousands of basic cells) and includes a big amount of randomization as machine may not understand the schematics as a whole, so even a small change in core code causes complete different routing (hence my suggestion to try other SEED value). in some cases, when routing is very tight, timings may become close to the edge, and then small fluctuation of voltage or heat on specific FPGA chip may shift timing further and provoke the glitch. This is a brief explanation. You may read more about FPGA work in the internet if you are curious.

movisman commented 2 years ago

Hi,

Apologies for the late reply - i've been away this weekend. Thanks very much for the information, I appreciate you taking the time. I will have a read when I get time on basic FPGA work so I at least have some understanding of the concept. It is not an area I have any familiarity or knowledge on unfortunately - however, because I now use MiSTer for my computer/console needs exclusively over any Raspberry Pi or x86 system, it would be good to have a very basic understanding at least.

Good to know that the side effect I am seeing does not indicate a faulty board, or signs of a fault that may develop in the future.

For now, as it doesn't affect stability, I can just live with the artifacts during network activity. Either that or I will simply change the video mode in the .ini to 1920x1080@50hz, which doesn't show any artifacts whatsoever.

I may look into how to compile my own menu core - will do some research on that. If you have any pointers for this or a starting point on SEED values that might be useful. But don't worry about it if not. The above info has already been very useful, if nothing else it has put my mind at ease about the issue. And from what you say it seems not to be a problem with the core itself. It sounds like if the core code was changed again in the future, the issue might even go away by itself depending on how things are routed.

Thanks a lot.

sorgelig commented 2 years ago

SEED value is used for randomization while routing/fitting (making decision to go left or right on each step). There is no "good known" values obviously as it's random :) You only can change it (default is 1) and see result. On each code change routing may be changed and thus "golden" seed will be different too.

From my experience value 1 gives best routing in most times. Other values look more like "let's do something unusual this time". But it's just my thought anyway. Quartus developers could tell more :)

movisman commented 2 years ago

Hi again,

So, I did some more extensive testing, and it doesn't seem to be the menu core causing the issue.

I have tested a lot of combinations and using a process of elimination, the issue is with MiSTer main (not OS or menu). I have worked out the exact MiSTer main release where the issue started, studied the commits and I believe I have found something that might be worth a look, if you don't mind.

I have a lot more data if you want it, but in a brief form:

Please would you mind spending a moment to have a look and see if I might be correct?

I really think I might have found where the issue stems from.

Thank you very much!

sorgelig commented 2 years ago

Hard to say why it affects. Probably DDR3 bandwidth usage hits the limit. With more compressed buffer, probably more access happens on beginning of line.. Hard to say. Btw you can try fb_size=2 in ini which will scale down the buffer to half size keeping FHD resolution. This will reduce bandwidth usage.

movisman commented 2 years ago

Hi,

Thanks for the reply - yes this is what I thought as well, possibly having the framebuffer set at 1080p/60hz combined with busy network traffic is hitting some sort of memory bandwidth limit?

Looking at that commit I linked, it looks like prior to this change the framebuffer was operating at a lower resolution (according to the changes on line 27?). Which makes sense as the artefacts do not appear before this commit was made. So I guess possible bandwidth constraints were not being hit back then, which is why I never saw the problem in the past.

I tried setting fb_size=2 in the .ini and this also removes the artefacts when heavy network traffic is occurring. The trade off is that the terminal and script output displays at 50% of the resolution, as well as any background you might have set. However, I don't use wallpapers (I prefer plain or fuzz) so it isn't an issue for me. Larger terminal output isn't a problem either. I might use this option.

The other option which seems viable is to set video_mode to 1080p/50 instead of 1080p/60, as this doesn't show any artefacts and you can still run the framebuffer at 'full'. Perhaps 60hz is just pushing things slightly too far.

However, on vsync_adjust .ini states ; For proper adjusting and to reduce possible out of range pixel clock, use 60Hz HDMI video ; modes as a base even for 50Hz systems.

However, it doesn't seem to be a problem for me so far using vsync_adjust 1 or 2 - my monitor is able to adjust without any 'out of range' issues.

What I am curious about, is if it is indeed the DDR3 hitting some kind of bandwidth limit, wouldn't other users likely see this as well if they were to test at 1080p or higher? To me it appears to be an issue that would affect most if not all devices. I would imagine (although I cannot test) that resolutions higher than 1080p such as 1920x1440@60 or 2048x1536@60 would likely encounter similar problems.

Thanks a lot as always.

sorgelig commented 2 years ago

; For proper adjusting and to reduce possible out of range pixel clock, use 60Hz HDMI video ; modes as a base even for 50Hz systems.

This is for HDMI pixel clock to make sure it won't go to extreme. Not directly related to memory bandwidth.

What I am curious about, is if it is indeed the DDR3 hitting some kind of bandwidth limit, wouldn't other users likely see this as well if they were to test at 1080p or higher? To me it appears to be an issue that would affect most if not all devices. I would imagine (although I cannot test) that resolutions higher than 1080p such as 1920x1440@60 or 2048x1536@60 would likely encounter similar problems.

This is one of reason why i don't like to add some features. HPS part should be as much free as possible.

movisman commented 2 years ago

This is for HDMI pixel clock to make sure it won't go to extreme. Not directly related to memory bandwidth.

Ah yes, I did understand that already, I was just wondering why using 60hz reduces possible out of range pixel clock, vs using 50hz. I wondered why when using 50hz there is a higher risk of things being out of range. I didn't explain that too well.

This is one of reason why i don't like to add some features. HPS part should be as much free as possible.

Agreed. I think people who are asking for too many features/bloat sometimes aren't aware this can be at the cost of performance/stability, which is much more important.

Do you think there is anything that can be done to free up enough resource so that no artefacts appear at 1080p/60 during network activity? I guess not as you would have probably already done it by now :) Do you have a 1080p display to see if you have the same issue at 1080p/60hz? Just wondering really. Like I mentioned before, it is a cosmetic issue, it doesn't cause any crashes. Some might consider a non-issue but a few might be alarmed by it if they don't realise why it occurs.

Unless it can be fixed, my likely workaround would be to stick using 1080p/60hz video_mode=8, but reduce framebuffer to 50% as I don't really care for backgrounds, and the terminal is only used occasionally or to monitor update_all.

Thanks!

sorgelig commented 2 years ago

Usually 50 and 60hz have the same pixel clock with larger blanking fields on 50Hz mode. If you use 50hz mode as a base and core needs 60Hz video then it will boost pixel clock by 20% to get 60hz.

I'm using 2048x1536@60 mode on my development setup. I'm not using network extensively. Usually just upload some files time after time. I didn't notice such issue yet.

movisman commented 2 years ago

Usually 50 and 60hz have the same pixel clock with larger blanking fields on 50Hz mode. If you use 50hz mode as a base and core needs 50Hz video then it will boost pixel clock by 20% to get 60hz.

So, in this case it is more optimal to use 60hz as a base, as you've already outlined in the .ini description? If the core needs 50hz video, and you've got 60hz as a base, it would reduce pixel clock to get to 50hz, is that correct? I assume this is preferable to boosting.

I'm using 2048x1536@60 mode on my development setup. I'm not using network extensively. Usually just upload some files time after time. I didn't notice such issue yet.

Perhaps you can keep one eye out if you ever happen to be sat on the menu, and send a large file (100mb+) over FTP :) I wonder if it exhibits similar behaviour or not.

But you are correct, now the MiSTer is fully set up and up to date, I don't use network that extensively. It is only when sending large .zip files or .vhd images I first noticed it.

sorgelig commented 2 years ago

If the core needs 50hz video, and you've got 60hz as a base, it would reduce pixel clock to get to 50hz, is that correct? I assume this is preferable to boosting.

Right. In my post it was mistype: "if core needs 50hz" should be "if core needs 60hz" (i've corrected already). But you've got it right anyway.

movisman commented 2 years ago

Got it. Ah yeah I had already assumed that it was a typo, so all good.

With that in mind, it seems better to stick with 60hz as a base - as I can use vsync_adjust 1 or 2 without issue. Pixel clock can then reduce to 50hz or whatever when the core requests it. But I'll reduce the framebuffer to 50% to stop the terminal/background interference during network activity. Let me know if you ever notice the issue, I'd have thought yours would do it too especially as you run a higher resolution than me. Slightly curious if it is related to my device or not (it really doesn't seem like it - probably just hitting the limits).

Anyway, framebuffer at 50% is a great solution, so thanks for pointing me at that option.

bootsector commented 2 years ago

I’ve seen this happening a few times when running the updater script from SSH. I’m also using video mode 8.

movisman commented 2 years ago

Just wanted to post a couple of 'workarounds' further to the above, which people can use if they come across this issue and they are wondering what is occurring.

For clarity, I did try to compile the menu core with a different SEED value but it behaves the same no matter what. As per the posts above, the suggestion from Sorgelig is it is probably a bandwidth related issue.

@bootsector thanks for confirming that you see this also. It isn't just me then ;)

Still not sure if this just affects some boards and not others. To me if it's a bandwidth issue, you'd think it would affect all.

Here is a better example of what happens (especially from 20 seconds in): https://photos.app.goo.gl/594HsvnBs9fU9yU46

The issue occurs for me when the resolution is set to 1920x1080@60 (it happens at 1920x1080@50 too but is almost unnoticeable). Lower resolutions are totally fine. I cannot test the two higher resolutions (1440x1080@60 and 2048x1536@60) as I have no supported devices. I also tried it over HDMI>VGA using a converter on a different display. Same thing. To see the issue, framebuffer should also be set to auto or full. I am using an official spec and latest digital I/O, USB Hub, plus 128mb SDRAM.

The issue will happen when there is active network transfer on either WiFi or Ethernet (eg. a large file uploading via FTP to MiSTer, for example .vhd - or running the update_all script). When network traffic is done, the interference on screen stops. It will happen during the following scenarios:

It does not happen (most importantly) when using a core. Either way you shouldn't be uploading files to MiSTer when a core is running anyway! :)

To workaround the issue, you can either:

a) Stay at 1080p60 and reduce the framebuffer in MiSTer.ini to 2 (fb_size=2), this will increase the size of the text in the terminal as it's now half resolution, but it also has the unwanted effect of making any 1080p custom background half resolution too, and these will be blocky - also any PDF file viewed in the help viewer will almost be unreadable as it will be half resolution. It doesn't get scaled nicely.

However it stops the screen jumping about, so you can either apply this globally or just to the menu core only - if you add the appropriate entry for "menu" in MiSTer.ini for menu to have it's own settings. This way the framebuffer change only affects the menu core.

b) Or, you can add an entry to MiSTer.ini for the menu core only to run it in a lower resolution, eg. 1920x1080@50 (the issue is practically unnoticeable in 50hz). Custom backgrounds and PDF's in help viewer will also display nicely. This is probably the way to hide the issue the most effectively, so long as your display supports 50hz. Framebuffer can remain at auto or full.

You can either apply this globally or just to the menu core only, if you add the appropriate entry in MiSTer.ini. Same as above. Display would switch back to 60hz when a core is launched that wants to run in 60hz.

--

I am not sure if there are any other ways to reduce bandwidth usage without display compromises. @sorgelig do you know of anything that can be done on MiSTer to stop those limits being hit during transfer?

This is such a minor issue though. But anyway, perhaps the above is useful to someone. I only just saw the comment by @bootsector and thought i'd follow up.

Thanks!

birdybro commented 2 years ago

menu.zip

Try this out for me and see if there is any difference. :)

movisman commented 2 years ago

menu.zip

Try this out for me and see if there is any difference. :)

Thanks for coming back to me, I appreciate it. Unfortunately the behaviour is the same though with the menu.rbf you posted above. What did you change?

Thanks a lot!

birdybro commented 2 years ago

I disabled restructuring of the multiplexers in the synthesis settings, which marginally improves timing on the core (which is already meeting timing anyways). Just was hoping it would make some noticeable difference out of curiousity. It uses a lot more logic space when this is done, so it's probably not something that would be an easy fix anyways. Was mostly curious. :)

movisman commented 2 years ago

I disabled restructuring of the multiplexers in the synthesis settings, which marginally improves timing on the core (which is already meeting timing anyways). Just was hoping it would make some noticeable difference out of curiousity. It uses a lot more logic space when this is done, so it's probably not something that would be an easy fix anyways. Was mostly curious. :)

Ah, well thanks very much indeed for trying, I appreciate it. One thing that I find strange, is how people have not noticed this phenomenon before - especially as 1080p60 is probably one of the most popular resolutions. (although user bootsector mentions above they have seen it too). I'd have thought it was quite common for people to upload files to their MiSTer while sat on the menu core. Is your configuration as such that you can replicate it?

It only started happening once the framebuffer behaviour was changed to match the output resolution by default (auto), as per https://github.com/MiSTer-devel/Main_MiSTer/commit/928b9d3222261fc4070254488c5dfbfa568e8f97 a while back. From the info above, framebuffer running at full consumes more bandwidth, and when you add in network activity, maybe things get a little close to the edge. Halving the framebuffer in the .ini does fix the screen jumping but in turn makes PDF help files unreadable and any custom background rather blocky. Of course, this is a minor niggle, and workarounds are available to mitigate the issue. I still thought it was worth raising though, as users might be slightly alarmed if their screen started jumping around / flashing a bit during network activity. For now I have personally set my menu core to 1080p50, as bringing down the frequency but keeping the same resolution seems to be a good compromise. I believe although the issue still occurs in this resolution - you have to look really, really hard for it. Using a lower resolution and the issue is not seen at all from what I understand.

Still curious to know if this happens to more people though, or if others don't see it at all.

birdybro commented 2 years ago

I've noticed it, but it hasn't seemed noteworthy to me because I vaguely understood the relationship between HPS and FPGA and the framebuffer. :P

I can understand why most people wouldn't notice it, wouldn't think to mention it, or if they did notice it that they might just assume it's normal. It happens to everyone pretty much, I'm sure.

I've seen an experienced what you are witnessing, it happens on 1080p for me too.

movisman commented 2 years ago

I've noticed it, but it hasn't seemed noteworthy to me because I vaguely understood the relationship between HPS and FPGA and the framebuffer. :P

No, that's fair enough. Even less noteworthy if you understand HPS/FPGA/Framebuffer topics more than the average user! :)

I can understand why most people wouldn't notice it, wouldn't think to mention it, or if they did notice it that they might just assume it's normal. It happens to everyone pretty much, I'm sure.

I've seen an experienced what you are witnessing, it happens on 1080p for me too.

Well, this makes me feel better to be honest. I was mainly curious as to why the issue exists, if it can be fixed, or if it is normal behaviour that happens to everyone. Or perhaps it might be something related to particular boards, etc. Of course, it really is a non-issue. It doesn't cause any instability and doesn't happen when you are running a computer/console/other core.

Thanks for the input though - the fact you see it as well does make me feel a lot better that it isn't related to my board or other component causing this effect. That was one of the things I was most curious about, especially as no-one has really mentioned it before (also understandable).

Thanks!

fjsj commented 7 months ago

@movisman do you still have the videos of this issue? The URLs are not working. From the discussion here, I think I'm facing the same issue:

Those issues happen with me even w/o network traffic. But when I use fb_size=2, the first issue is greatly reduced to a couple of pixels and gets less frequent, while the second one seems gone.

So two questions, I would gladly appreciate some directions here:

Sorry if this is a separate issue, but from the the description it seems quite similar, and the solution is the same.

birdybro commented 7 months ago

This can depend on your display's compatibility sometimes. I've seen some Sony tv's do this before. I don't see the same thing on my LG G1 for instance, but I did see it on a Sony Bravia at a relative's house. It's not really indicative of a silicon binning inconsistency.

fjsj commented 7 months ago

Thanks for your input @birdybro. I did test different displays: Samsung TV QN85B and Gigabyte Monitor M28U. Both exhibit the same problems. I tried disabling CEC on the TV, tried different PSUs, different HDMI cables (2.0, 2.1), I tested different vsync_adjust configs. All combinations exhibit one of the 2 problems I recorded and shared above. Only solution for me was fb_size=2. So not sure if we have the same issue or some displays are more resilient to such interference/noise (?).

movisman commented 7 months ago

Hi,

Unfortunately I don't have the videos anymore - I did a big clean up of my Google Photos a while back and they probably got discarded.

For me this issue does still occur, but it does look different to yours. For me (and others it seems), the whole screen will judder/shake or flicker, or show artifacts. This goes back to 2021. Release of MiSTer main MiSTer_20210207 worked fine, but since MiSTer_20210222 the issue occurs.

One of the commits between these versions was here: https://github.com/MiSTer-devel/Main_MiSTer/commit/928b9d3222261fc4070254488c5dfbfa568e8f97

I thought this might be related, but I don't know for sure. Something certainly changed between these two versions as the issue simply doesn't exist in release 20210207.

For me it only happens during network transfer, and I've only noticed it when browsing the menu, viewing the terminal (eg. monitoring update_all), or looking at PDF help files when in a core. So it's not a common occurrence as you have to be doing one of these things AND performing a network transfer.

Indeed, it can be solved by changing framebuffer size or reducing resolution and/or refresh rate for the menu core.

Workarounds above are still valid now it seems: https://github.com/MiSTer-devel/Main_MiSTer/issues/482#issuecomment-1139827377

I would find it more irritating if it was there all the time, but for me, in the end I just ignore it as I only get a bit of screen flicker/judder when there is network activity under specific scenarios.

But have always remained somewhat curious about it, and wondered if it would happen on another board or not ;)

birdybro commented 7 months ago

You can probably close this issue @movisman as it's expected behavior given the limitations of HPS and how the data is drawn to the framebuffer in the menu core, as explained by sorgelig.

movisman commented 7 months ago

You can probably close this issue @movisman as it's expected behavior given the limitations of HPS and how the data is drawn to the framebuffer in the menu core, as explained by sorgelig.

No worries - to be honest I thought this issue was closed long ago until @fjsj replied to it!

I will mark this as closed.

Cheers