denysvitali / linux-on-pixel-c

Documentation on how to run a Linux Distro on the Google Pixel C (2015)
MIT License
61 stars 5 forks source link

Green Bars Issue #1

Closed denysvitali closed 6 years ago

denysvitali commented 7 years ago

Documented here

vartom commented 6 years ago

Bars are obtained due to disruption to the DSI mode even-odd split. a little about the differences you can see here

But this mode should not work if you believe this message https://github.com/denysvitali/linux-smaug/blob/4.14-rc6/drivers/gpu/drm/panel/panel-sharp-lq101r1sx01.c#L227

Edit by @denysvitali: Fixed links

denysvitali commented 6 years ago

Yeah, you probably are right. This may explain the odd behaviour that can be sometimes happen when using our kernel: the left half of the screen goes black, whilst the right side stays "on" with the content displayed correctly.
screenshot from 2017-12-11 08-34-46

I still don't understand what are the circumstances that cause the issue though

Edit: Btw, we're using the JDI LPM102A188A panel

Samt43 commented 6 years ago

@vartom I have exactly the same problem on my device, These green bars are always there and i couldn't find any way to get rid of them until now unfortunatly. I think that yes, the problem come from DSI port configuration but i still can't understand why it works correctly with google stock kernel. "To support a different mode a mechanism needs to be put in place to communicate the configuration back to the DSI host controller." No idea how google did it, but they should have done it somehow

denysvitali commented 6 years ago

Apparently w/ linux-4.8.y (refer to this comment for more information on how to run it) the bars don't appear. There are still some (minor) glitches that disappear after the screen is clicked.

marcan commented 6 years ago

I don't think this has anything to do with DSI. I think this is an issue with the display engine fetching pixels from the framebuffer. Maybe @Gnurou could shed some light on what might cause this?

vartom commented 6 years ago

I analyzed the code in kernel 3.18 and the one that was applied in 4.8 and came to the conclusion that the problem can be hidden only in power management. https://github.com/denysvitali/linux-smaug/commit/87904c3e82319cf2bad8d656d79c5030dab9490e I also found that https://github.com/denysvitali/linux-smaug/commit/f08c32f1123570e03521c9bec2e8628294b66fbd this is a message that in the case of our tablet is simply necessary.

marcan commented 6 years ago

I'm seeing the green bars issue on another device and I'm 99% certain it has nothing whatsoever to do with DSI. I know this because the green bars do not affect the hardware cursor, and because they scale together with the framebuffer. If it were a DSI issue it would affect the entire screen regardless of where the pixels come from. It's possible we're looking at different problems, though, but the end result certainly looks similar...

denysvitali commented 6 years ago

@marcan Is the device you're talking about Tegra (or just NVIDIA) based?
May I know what device is it, and which kernel version works / displays the bars?

marcan commented 6 years ago

It's Tegra210, same as the Pixel C. I'm not doing much development, just testing, but I'll try to drag someone else who is working on it here. The kernel is mainline git (somewhere circa 4.15-rc9).

dhewg commented 6 years ago

I'm seeing similar effects on this cutie: https://imgur.com/a/ZZFDi This is on mainline 5132ede0, somewhere between v4.15-rc9 and v4.15

While a reboot may work around that, this script does too (on this device there's a ~50% chance it'll work afterwards):

cat panel-reinit.sh

!/bin/sh

echo 1 >> /sys/devices/platform/50000000.host1x/drm/graphics/fb0/blank echo 0 >> /sys/devices/platform/50000000.host1x/drm/graphics/fb0/blank

denysvitali commented 6 years ago

@dhewg Yeah, I tried that too (I didn't documented it though), and I can confirm ot works 25% of the time for me. Restarting lightdm (not sure why just lightdm) seems to fix the problem sometimes, and sometimes it triggers it

denysvitali commented 6 years ago

@marcan Have you tried with an older kernel? 4.9.y seems to work fine on the Pixel C. I'm not sure at the moment if it's caused by the Nouveau driver. What panel are you using? The Pixel C uses a JDI LPM102A188A

dhewg commented 6 years ago

We didn't yet try an older kernel. And this is a JDI LPM062M326A panel.

vartom commented 6 years ago

@dhewg The JDI LPM062M326A panel you have connected to two DSI channels? those running in ganged mode? Can you show the driver source for this panel?

Gnurou commented 6 years ago

I have seen that issue, but don't know what the cause is. @thierryreding may have some insight.

dhewg commented 6 years ago

It's not running in ganged mode, but I don't think it matters. Try this:

cat itsnotdsi.sh

!/bin/sh

busybox devmem 0x54200100 32 5 busybox devmem 0x542013b0 32 0 busybox devmem 0x542010f8 32 0x3000000 busybox devmem 0x54201100 32 0x1000100 busybox devmem 0x54201008 32 0x20010000 busybox devmem 0x542013c4 32 0x1000000

This will set up a 256x256 black cursor at 256,256 on the screen. You'll get a black rectangle that punches right through all the green bars. There'd be no punching through if this was a DSI issue.

vartom commented 6 years ago

Why are you so sure that not DSI issue? How can you explain that after removing the PM, issue disappeared? P.s. In the kernel 4.14-rc6, the described rollback of the PM did not lead to a positive result.

marcan commented 6 years ago

A DSI issue would affect the entire screen. The panel doesn't know anything about the cursor. If the cursor is not affected by the green bars, it's guaranteed not to be a panel/DSI issue. If you try that shell script and you get a black square with no green bars, your issue isn't a DSI issue either. If you do get green bars through the cursor, then the problem you're seeing and the one we're seeing are different problems (that just look similar).

If the problem has to do with timing, power, clocking, bad internal state, or really any other "messy" nondeterministic origin (which it probably does, since the bars sometimes appear and sometimes don't), then a change in unrelated code could cause timing differences that change how often it happens, without having directly changed relevant code.

vartom commented 6 years ago

In my case, if i use 4.14-rc6 kernel, I always see only the bars as on the first photo in the first post. They are green or later become blue. Only at 4.8/4.9 with my changes I saw a normal picture on the screen. The script will try to apply later.

Samt43 commented 6 years ago

I have always greens bars on my pixel c with recent kernels. I could do some test some time ago on my pixel C and it seems that display is not setup in even-odd split. I can affirm this by this : if you modify mipi_dsi_dcs_set_page_address by changing the pixel column address for one of the DSI Lane, then only the right of the left part of the screen is modified.

https://github.com/denysvitali/linux-smaug/blob/4.15-rc8/drivers/gpu/drm/panel/panel-jdi-lpm102a188a.c

ret = mipi_dsi_dcs_set_page_address(jdi->dsi, 0,
            jdi->mode->vdisplay - 1);
if (ret < 0)
    DRM_ERROR("failed to set page address: %d\n", ret);

https://imgur.com/a/SGDyX

ret = mipi_dsi_dcs_set_page_address(jdi->slave, 0,
            jdi->mode->vdisplay - 1);
if (ret < 0)
    DRM_ERROR("failed to set page address: %d\n", ret);

https://imgur.com/a/xNTAz

So maybe the issue is not related to even odd mode, or maybe DSI is not in this mode while display is in this mode ?

Samt43 commented 6 years ago

The noise observed on the screen when I did the column modification didn't contains any vertical line... I'm wondering if these lines are not really written on the framebuffer. So strange.

marcan commented 6 years ago

The lines are not in the host framebuffer memory (you can confirm that by dumping /dev/fb0) and they are not introduced by the DSI block or the panel (you can confirm that with the cursor trick mentioned earlier). If you modify the DC window A to not cover the full display the non-covered portion is also free of lines. That means the problem is introduced somewhere in the pixel fetch pipeline. As far as I know that's roughly EMC→MC→SMMU→DC(window A).

vartom commented 6 years ago

del.

marcan commented 6 years ago

I don't think the problem is a simple performance issue (memory subsystem not meeting streaming requirements of the display). The X1 TRM says DC handles underflow by repeating the last valid pixel. That's not what we're seeing here. Enabling underflow debug mode (should show a solid pixel) doesn't change the output. The underflow counters (DC_A_WINBUF_AD_UFLOW_STATUS_0) are zero.

Samt43 commented 6 years ago

I could try the smaug kernel forked by denysvitali, and no green bars anymore ! lightdm and weston are now working without bars, only graphical acceleration seems to be missing due to : MESA-LOADER: failed to retrieve device information gbm: failed to open any driver (search paths /usr/lib/dri) gbm: Last dlopen error: /usr/lib/dri/tegra_dri.so: cannot open shared object file: No such file or directory https://ded1.denv.it/boot-linux-4.8.y-d4ed7e5e8657_20180207_122924.img.unsigned Thanks !

vartom commented 6 years ago

@Samt43 4.8 this is the kernel with the fix I found.

Samt43 commented 6 years ago

@vartom This kernel works like a charm for me :) So the fix doesn't work with more recent versions if i understand ? How did you came to the conclusion 4.8 could work with this ? Thanks again, i'm now investigating why gbm doesn't load the tegra driver correctly and load swrast instead.

vartom commented 6 years ago

I tried similarly to roll back the kernel 4.14. It did not help. the kernel did not start. Maybe I'm just somewhere wrong and not all rolled. A similar patch helped run 4.9, I did not try other versions to run. I analyzed the native 3.18 kernel and compared it to the mainline kernel. I checked two very early versions of 3.18 to test how the screen works with the minimum number of patches made by Google (chrome OS team). After that I found out that only one patch №1( https://github.com/denysvitali/linux-smaug/commit/f08c32f1123570e03521c9bec2e8628294b66fbd ) is needed for the correct operation of the screen. After that, I first tried to apply patch №1 in the maneline kernel. As it turned out, new versions of the kernel contain changes in this portion of the code. Attempts to adapt the patch №1 were unsuccessful. It was after this that I decided to roll back some of the kernel for the ability to apply that patch №1. As it turned out, it immediately gave normal operation of the screen. While I was analyzing the cores, I managed to rewrite the panel driver, took as a basis panel-sharp-lq101r1sx01 from the mainline.

thierryreding commented 6 years ago

@vartom I suppose this could've been caused by the power domain support patches. Patch Nr. 1 that you've linked to seems to intend to reset the device prior to using it. Unfortunately that no longer works since commit 64230aa075864f7c8e270b583bca685304246b57 in mainline. That change is supposedly the right thing to do, but I could imagine that it might break things.

One thing you could try is take a recent mainline (or linux-next) and edit arch/arm64/boot/dts/nvidia/tegra210.dtsi by looking for the two dsi@... nodes and commenting out or removing the power-domains properties and at the same time removing the DSI resets from the SOR power domain (look for the label "pd_sor:" and remove the TEGRA210_CLK_DSI{A,B} entries from the resets property).

The effect of that should be that the code conditionalized on !pdev->dev.pm_domain in the DSI host driver will be enabled, meaning that the driver will take manual control of the reset (like it used to before the generic PM domain support patch).

You should be able to reproduce the effect of the ChromeOS patch by adding a reset_control_assert(dsi->rst) after the devm_reset_control_get() in tegra_dsi_probe().

This is all somewhat complicated, so if you're not in a hurry to test this out, I'll try to whip up a proper patch for you to test when I'm at my workstation tomorrow.

marcan commented 6 years ago

@vartom Did you ever get a chance to run that cursor test? It would be good to know if your problem has the same behavior as the one we're seeing.

@thierryreding do you think issues with the DSI block could affect upstream logic in DC, or perhaps we're seeing two different problems with a similar manifestation here?

vartom commented 6 years ago

@marcan I can not do this cursor test. Since when I see the strip on the screen the system does not boot. Only the kernel is started. Accordingly, the script is not executed.

marcan commented 6 years ago

This fixed it for me. It seems the powergate logic does not reset DC properly, so it comes up in a broken state.

--- a/drivers/gpu/drm/tegra/dc.c
+++ b/drivers/gpu/drm/tegra/dc.c
@@ -2011,7 +2011,7 @@ static const struct tegra_dc_soc_info tegra210_dc_soc_info = {
        .supports_block_linear = true,
        .supports_blending = true,
        .pitch_align = 64,
-       .has_powergate = true,
+       .has_powergate = false,

I wonder if the issue has something to do with the fact that the DC clock is dynamic and derived from the mode. Maybe it's not set up when it is powered up, and it needs a clock to reset properly? Or maybe it's a timing issue and the powergate code does not give DC enough time to reset, or something similar.

delroth commented 6 years ago

Confirmed that disabling powergating for DC seems to fix the issue. I could observe some pseudo-randomness in the DC reg state across reinits before, that has now disappeared.

Increasing the delays in the powergate sequence code to the values dc.c uses doesn't seem to help.

vartom commented 6 years ago

@delroth Did you check this on a pixel С? on which version of the kernel?

cyndis commented 6 years ago

I have had similar coloured stripes on a Jetson TX1 on HDMI output - disabling powergating seems to fix this for me as well.

cyndis commented 6 years ago

This could be related to missing DMA queue flushes on powergating, looking into that

vartom commented 6 years ago

@thierryreding Works great. Thanks for describing the solution. The variant you proposed worked together with the addition of has_powergate = false. Smaug. Kernel 4.15.2 mainline.

thierryreding commented 6 years ago

@vartom @marcan Thanks for that hint. Disabling power gating unconditionally is likely going to break for devices where the bootloader did not power up the display partitions. However, the fact that it works with has_powergate = false makes me think, as @cyndis suggested, that we're running into one of those strange issues where the memory access queueing breaks. There's a set of patches that attempt to fix that here: https://github.com/thierryreding/linux/commits/for-4.17/work

But initial tests (by @cyndis) indicate that that doesn't fix the problem, at least for HDMI. Unfortunately I can't reproduce these bars on any of the devices that I have, so I can't test that these actually fix anything. At least they don't break on any of my setups so far.

Only the top 6 patches are relevant, but the tree is otherwise heavily modified, so they may not apply cleanly on top of v4.15.2. @cyndis was able to cherry-pick them to a recent linux-next, though. It's unlikely that this will fix DSI if it doesn't fix HDMI, but in case you want to give this a try anyway, that's where you'll find the code.

In the meantime, I'll keep investigating what else this could be.

thierryreding commented 6 years ago

Oh, perhaps one interesting test would be to force a modeset and see if things are fixed after that. Maybe repeat a couple of times to see if things break again after subsequent modesets. That would give us a clue about whether the issue is limited to driver probe time or more generally to suspend/resume.

You should be able to force a modeset using the instructions from @dhewg's comment earlier:

echo 1 >> /sys/devices/platform/50000000.host1x/drm/graphics/fb0/blank
echo 0 >> /sys/devices/platform/50000000.host1x/drm/graphics/fb0/blank

which should be equivalent to the slightly shorter:

echo 4 > /sys/class/graphics/fb0/blank
echo 0 > /sys/class/graphics/fb0/blank
marcan commented 6 years ago

@thierryreding your patches don't help... but I found something else that does. This patchset.

https://patchwork.ozlabs.org/project/linux-tegra/list/?series=25347

Unfortunately, it's broken on the legacy API path. tegra_powergate_power_up segfaults because pg->pmc is uninitialized in the path coming from tegra_powergate_sequence_power_up which uses a fake struct tegra_powergate. And of course, checking for NULL elides the workaround and doesn't fix the problem. However, hacking it to always do it does:

--- a/drivers/soc/tegra/pmc.c
+++ b/drivers/soc/tegra/pmc.c
@@ -432,7 +432,7 @@ static int tegra_powergate_power_up(struct tegra_powergate *pg,

        usleep_range(10, 20);

-       if (pg->pmc->soc->needs_mbist_war)
+//     if (pg->pmc->soc->needs_mbist_war)
                err = tegra210_clk_handle_mbist_war(pg->id);
        if (err)
                goto disable_clks;
thierryreding commented 6 years ago

Excellent! I was just looking at that as well and have a proper fix for the segfault that your seeing. I'll upload a new tree in a second.

thierryreding commented 6 years ago

Alright, here's a tree which should include everything that @marcan used to fix this:

https://github.com/thierryreding/linux/commits/for-4.17/work

Nevermind the top patch in that tree, it's not related to Tegra210.

@vartom, @delroth: can you confirm that the MBIST workaround + the PMC segfault make the green bars go away for you? Also, is this reliable in that the green bars don't come back on modesets?

marcan commented 6 years ago

Works for me with thierryreding/linux@13eea63a70 on top of my existing tree which already had the MBIST workaround series merged in. (I hadn't realized pmc was a global available in that file, that's obviously the right solution, heh). Still works after repeated modesets.

thierryreding commented 6 years ago

Yeah, it's not terribly pretty, but we need the global for the early code anyway, so might as well reuse it here. The tegra_powergate_sequence_power_up() is really a legacy API that we should remove eventually, but we haven't fixed up all the drivers yet. I'm sure we'll get there some day, but in the meantime I think this is a good workaround.

Now it's up to Peter for finalizing the MBIST series, then we can merge it all into linux-next and get broader testing. Hopefully that will help you guys make progress on this interesting project.

vartom commented 6 years ago

@thierryreding Yes, and now it works for me (MBIST).

thierryreding commented 6 years ago

@marcan, @vartom: would you mind providing an official Tested-by: on the mailing list for Peter's MBIST patch series?

vartom commented 6 years ago

I do not mind. I'll try what I can.

thierryreding commented 6 years ago

Just in case you don't know how, you'd just reply to either all of Peter's patches with a line saying:

Tested-by: Your Name <your@email>

Or reply to a single patch (usually patch 0/X if there is one, or patch 1/X otherwise) with something like:

For the series:
Tested-by: Your Name <your@email>

That way I can easily pick that up when I apply to the Tegra tree and you get the credits that you deserve for testing this.

marcan commented 6 years ago

@thierryreding Done (hope I got the In-Reply-To right, I wasn't previously subscribed so I had to snoop around archives).

thierryreding commented 6 years ago

@marcan: Looks good. Thanks a lot!