bdbcat / oesenc_pi

GNU General Public License v2.0
10 stars 17 forks source link

oesenc_pi slowdown and display issue on Raspberry Pi 3B #74

Open jongough opened 4 years ago

jongough commented 4 years ago

I have been using the oeSENC charts for Australia since they became available on my Linux machine. They have worked well. I have just upgraded to the new oeSENC to get the latest charts, however, there appears to be a problem on my Raspberry PI 3B.

On the Raspberry PI 3B I get grey areas and in the system journal lines showing 'kernel: [drm] Resetting GPU' occur. At this point the PI is essentially locked up and takes 5 or more seconds to respond to any input. OCPN may crash or lock up the PI to make it unusable unless I can zoom in more. The CPU consumption goes from ~10-20% to 1% at this time.

The zoom level, i.e. how far out I can zoom, is not very far. Due to the slowdown on the PI and the inability to interact with it I cannot get a screen shot of it happening.

This is from the PI. The location is Manly marina, Brisbane on the SE corner of Qld. The first shot is the starting point, then 4 presses of the "-" button is the next shot and 1 press of the "+" button is the last shot. The "Resetting GPU" shows up in the journal once the "grey areas" start to show up. Also the system becomes really unresponsive. The charts are in their own group and currently only the oeSENC charts are being displayed. If I show the CM93 charts instead it all work OK. I have tried both 64M and 128M for the graphics split but it makes no difference. The system is using OpenGL.

2020-03-02-121219_1280x1024_scrot 2020-03-02-121320_1280x1024_scrot 2020-03-02-121335_1280x1024_scrot

Further investigation shows that reducing the detail on the charts alleviates the issue of the GPU resetting. If I set the vector chart detail level to 0 on the display/advanced/weight the system seems to work, whereas at 1 and above causes the problem.

Trying to navigate around Morton bay we need as much detail as possible as it is all shallow, hence normally using a setting of 2 on the PI. I would like to use a setting of 5 (which I do on Linux) as it helps to have as much detail as possible

bdbcat commented 4 years ago

Jon... Is this a new issue, with a recent release of oesenc_pi for rPI? Did it work better with a previous release?

jongough commented 4 years ago

Yes, this only occurred after I updated the plugin for the latest set of charts about 3 weeks ago. I mentioned in the forum that I had the issue on Linux Mint as well although it was crashing rather than giving a GPU error. At the time I could not reproduce it. I am home now and have been looking around at the 'big' picture of the north of Australia and had the Linux system crash, here is the only crash info that shows up:

Mar 02 08:07:52 mishka kernel: opencpn[3492]: segfault at 4 ip 00007f27ee7cdc45 sp 00007ffdcf57bf00 error 4 in liboesenc_pi.so[7f27ee705000+19d000]
Mar 02 08:09:17 mishka kernel: opencpn[3850]: segfault at 4 ip 00007f7b5d982c45 sp 00007fff29462bb0 error 4 in liboesenc_pi.so[7f7b5d8ba000+19d000]
Mar 02 08:20:00 mishka kernel: opencpn[5191]: segfault at 4 ip 00007fbe2e670c45 sp 00007ffe9cc65c20 error 4 in liboesenc_pi.so[7fbe2e5a8000+19d000]
bdbcat commented 4 years ago

Do you mean that a standard desktop linux system also crashes? What distro?

bdbcat commented 4 years ago

Jon... OK, sorry, I found the thread on CF https://www.cruisersforum.com/forums/f134/oesenc-4-0-2-crash-on-linux-231067.html#post3085892 Will investigate.

jongough commented 4 years ago

Here is a screenshot from Linux that shows the same issue as on the Pi, but it has a detail level of 5. Moving around by mouse dragging can cause the OCPN, not Linux, to crash: Screenshot_2020-03-14_15-05-51 I 'think' on linux it may have something todo with with texture cache, but I am not sure. It is MUCH easier to create the issue on the Pi.

bdbcat commented 4 years ago

Jon... I cannot reproduce, yet. Can you run OCPN under ddd or gdb, and get a backtrace at the crash point?

jongough commented 4 years ago

Dave, I ran under gdb and had to do quite a bit of panning around with the mouse including zooming in and out to get it to occur. Screenshot when crash occured Screenshot_2020-03-15_07-38-33 Backtrace of crash gdb crash in oesenc.txt

jongough commented 4 years ago

After the crash I ended the gdb session and started another one. I got the crash when the oesenc charts were being loaded. Here is the screenshot Screenshot_2020-03-15_07-48-17 here is the gdb backtrace gdb crash in oesenc 1.txt

jongough commented 4 years ago

Also, I just noticed that when scrolling around the gdb window shows a single line message: vklsmd: 0 repeating, but sometimes with blank lines between. So something is putting out a message.

bdbcat commented 4 years ago

Jon... I have looked into the rPi3 OpenGL question. After lots of test and experimentation, I have reluctantly concluded that this is a transient bug in the OpenGL driver for rPi3. I looked closely at memory utilization. The OpenGL driver on rPI3 shares memory between the GPU and the CPU. The current driver does this dynamically, so the specification of the memory "split" is ignored after initial allocations. What is happening when the rPI3 slows down is this: When memory gets "tight", the memory map gets dynamically "squished" and optimized, presumably in a separate background CPU thread. Meanwhile, the OCPN memory allocations and draw commands to the GPU are continuing in the foreground. The slowdown is likely due to serialization of access to the shared memory. Further, there seems to be some race condition where the GPU occasionally simply ignores a draw command, without error report, while this memory "squish" is happening in the background. Thus the "grey" areas, which is the default no-object color. Our only workaround is to reduce the memory footprint of OCPN, in order to reduce the impact of memory juggling between the CPU and the GPU. And one way we do that is to reduce the "vector chart detail level" to 0, or even less than 0 for some chart sets. That is the purpose of that control, to accommodate various platform performance capabilities. Another way is to reduce the pixel size of the attached display. I note that this problem does not present on the "official" 7" touch screen display. Ultimately, I think we are asking too much of this little 1GB ARM processor with super dense ENCs and large displays. I seriously recommend that if you are navigating actively with these dense ENCs, that you upgrade to an rPi4, with extended memory. My tests have shown a remarkable increase in speed and stability on rPi4. Sorry for the results... Dave

jongough commented 4 years ago

Dave, Something changed with the latest version of the oeSENC as before applying this change I had no problems with the rPi3B. It worked fine and could be used with detail level '2'. It was this latest version that seems to have used more memory and that started to cause the problem. Is there a reason for the apparent growth in GPU memory requirements? Is there a way to 'restrict' oeSENC to the amount of memory it can use?

The backtrace I provided and the newer screen shots are for a linux system. The linux system is much more difficult to create the issue on, but it does happen and sods law says it will happen when you can least afford it.

I only have our two navigation computers with these charts installed, so I cannot run it on my test systems to try and get better information. From what you say I will have a look and see if memory is getting tight on the rPi at the time the issue occurs.

bdbcat commented 4 years ago

Jon... Do you know what version you had installed before this change? I've looked at the changelogs, and the last relevant change to the GL code was more than a year ago.

bdbcat commented 4 years ago

Jon... Also, did you always have the same monitor attached to the rPi3?

bdbcat commented 4 years ago

Jon... I have arranged with o-charts to add another set of oeSENC AUS charts to your account, so that you may test on other systems.. You should see them in your "My oesenc charts" page on o-charts, and may install in the normal method. This should allow us to get more information on the question. Dave

jongough commented 4 years ago

Jon... Also, did you always have the same monitor attached to the rPi3?

Yes, the monitor has not changed, it is the old 1024x768 square format

jongough commented 4 years ago

Jon... Do you know what version you had installed before this change? I've looked at the changelogs, and the last relevant change to the GL code was more than a year ago.

I have attached two log files, one from the previous version of oeSENC and one from the current version. Hope this helps. opencpn-new version.log opencpn-previous version.log

bdbcat commented 4 years ago

Jon... The older version uses S63 charts only, so not exactly apples-to-apples. Did that older S63 version exhibit any similar display problems?

jongough commented 4 years ago

I had both s63 and oeSENC charts installed. I have now removed the s63 charts when cleaning up as I no longer have a license to use them and hadn't used them since I started using the oeSENC charts. However the log files I have given you do have the libs63_pi.so installed as well as the liboesenc_pi.so .

New oeSENC: API Version detected: 111 PlugIn Version detected: 400 Old oeSENC: API Version detected: 111 PlugIn Version detected: 204

bdbcat commented 4 years ago

OK. on that logfile I did not see any oeSENC charts loaded, so I wondered. Do you recall that the oesenc charts with that plugin version were displayed OK?

bdbcat commented 4 years ago

Also, some new news here. I have access to the original S57 ENCs, from which the oesenc charts were built. And so I can run these charts without a plugin. And...they show the same problem on my rPI with a large-ish screen (1600 x 1200). Not so surprising, since oesenc and native S57 are displayed by OCPN using essentially identical code. More to think about....

jongough commented 4 years ago

I think the last download I did prior to the one in Feb 2020 was on 29/8/2019 and was 'mishkapi3a-AU-2019-17.zip'. This older download worked fine with the old plugin. It was when I wanted to use the new download 'mishkapi31-AU-2020-3.zip' that I was told I had to download the latest version of the oeSENC_pi, which I did. I then started to get the issue.

jongough commented 4 years ago

I have just done a quick test on the Pi 3B, I set the detail level to 5 as I don't have all the boat instruments attached (may test with a replay of some stuff I have), I can get the slow down to occur. I was doing this whilst checking the memory being used and it didn't change much at all. The Pi 3B running OpenPlotter 1.02 and the latest prod build of OPCN with my normal plugins going has the following memory splits, as shown by 'top'.: KiB Mem: 895028 total, 191228 Free, 378852 Used, 325944 buff/cache

When the slowdown occurs the CPU, when it is reported is <1% (shown by the cpu meter on the top right) when it should normally be around 15%. So, in this case it does not appear to be running out of memory although the 'GPU resetting' message is occurring in the log. Perhaps the memory being shown via 'top' is not correct. Do you know a better way to display the memory being used dynamically that will show the split between the GPU and CPU?

jongough commented 4 years ago

Jon... I have arranged with o-charts to add another set of oeSENC AUS charts to your account, so that you may test on other systems.. You should see them in your "My oesenc charts" page on o-charts, and may install in the normal method. This should allow us to get more information on the question. Dave

Hi Dave, Thanks for that. I am holding off activating this until I work out which is the most useful platform for testing. I don't want to cause issues by changing platforms and needing to reset the access.

bdbcat commented 4 years ago

Jon... _"I was told I had to download the latest version of the oeSENCpi,"

The changes to the plugin requiring an upgrade apply only to the o-charts shop interface. The actual chart management and display did not change. So, you should be able to run the previous plugin, with either chartset, as a comparison. I will try to do the same.

jongough commented 4 years ago

I can find previous versions in the ppa, but there appear to be no downloadable versions. Can you point me to where I can find version 2.0.2 and 3.0.0 packages as these are the last two that I would have had installed.

jongough commented 4 years ago

I have managed to install version 3.3.09 using the older name of the plugin but it still has the same issue. I cannot get back to the version I had before as it does not seem to be available in the xenial build stream. If I could find the 2.0.2 version that would work with xenial then I could test that out to see if the issue occurred then.

hreuver0183 commented 4 years ago

Is this the same issue? https://github.com/OpenCPN/OpenCPN/issues/1791 Crash under heavy load and memory usage?

With an RPI with about 500MB free from 1GB and with an arm64 with 800-1400MB free from 4GB?

bdbcat commented 4 years ago

Probably the same. I am investigating further.

rgleason commented 4 years ago

if [ -n "$BUILD_GTK3" ]; then
    sudo update-alternatives --set wx-config \
        /usr/lib/*-linux-*/wx/config/gtk3-unicode-3.0
fi

cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_INSTALL_PREFIX=/usr ..   <---LN29
make -sj2
make package
hreuver0183 commented 4 years ago

When I add some fprint's most of the time it gets past

// Line Simple Style int s52plib::RenderLSLegacy( ObjRazRules *rzRules, Rules *rules, ViewPort *vp ) { fprintf(stderr,"RenderLSLegacy #1\n"); if( !rzRules->obj->m_chart_context->chart ) return RenderLSPlugIn( rzRules, rules, vp );

// Must be cm93 S52color *c;

It lands between the fprintf's below:

for( int iseg = 0; iseg < rzRules->obj->m_n_lsindex; iseg++ ) { int seg_index = iseg * 3; fprintf(stderr,"RenderLSLegacy #6a x\n"); index_run = &rzRules->obj->m_lsindex_array[seg_index];

// Get first connected node unsigned int inode = *index_run++;

// Get the edge unsigned int enode = *index_run++; VE_Element *pedge = 0; fprintf(stderr,"RenderLSLegacy #6a xx\n"); if(enode) pedge = (*ve_hash)[enode];

// Get last connected node

It does not allways reach the first fprintf, but it allways fails in s52plib::RenderLSLegacy. Therefore I wonder if it is caused by a mismatch in a memory freed and subsequent memory claimed action. My guess is that in the above lines the memory is increased by a large amount (?).

I've never seen the xx in on the stderr.

bdbcat commented 4 years ago

hreuver0183... Just to be absolutely clear: You are NOT using OpenGL, is that correct? Please confirm.

If so, then this is a different root cause. Jon is crashing due to OpenGL shared memory contention, we think. Dave

hreuver0183 commented 4 years ago

@bdbcat No OpenGL No soft OpenGL

I will stop replying to the issue to prevent messying up the issue.

bdbcat commented 4 years ago

OK, I will address your issue separately. Thanks