LinuxCNC is super slow due to highlighting in the preview widget

knipknap commented 1 year ago

Here are the steps I follow to reproduce the issue:

Open a large-ish gcode file, 500K is large enough on my machine. Note that it opens relatively quickly. Good.
Drag the preview around with the the mouse, but do not release the mouse button. Note that this works smoothly. Good.
Release the mouse button from dragging. Note that the UI freezes completely for a long time (in my case, 25 seconds on a 700K file, 10 minutes on a 3MB file)
Exactly when the UI finally recovers, a line in the preview is now highlighted.

In addition, the freeze is also triggered in other situations, such as a tool change in a running program. In general, the buttons are mostly unusable during any running gcode program.

This is what I expected to happen:

No freezes.

Information about my hardware and software:

LinuxCNC official ISO, 2.8.4-1-gb7824717b

But the problem exists in master, too.

Linux cnc 4.19.0-21-rt-amd64 #1 SMP PREEMPT RT Debian 4.19.249-2 (2022-06-30) x86_64 GNU/Linux

Distributor ID: Debian
Description:    Debian GNU/Linux 10 (buster)
Release:    10
Codename:   buster

$ cat /proc/cpuinfo | head -5
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 122
model name    : Intel(R) Celeron(R) J4125 CPU @ 2.00GHz

8GB RAM

Other info

The problem disappears when commenting out the lines calling glCallList() in glcanon.py:

https://github.com/LinuxCNC/linuxcnc/blob/master/lib/python/rs274/glcanon.py#L590

After that, LinuxCNC runs super fast - I can work with a 30MB file with no problem, and the preview still works (just without highlighting, of course).

c-morley commented 1 year ago

What Screen are you using? This is a known problem, depending in the screen, you may be able to disable line highlighting.

knipknap commented 1 year ago

I'm using Gmoccapy, I found no option for disabling the highlighting.

c-morley commented 1 year ago

Ok. QtDragon has this option. Maybe This could be added to GMoccapy.

The actual problem is the searching of the gcode line in the mega bytes of data. It's difficult to get python programs to do two things at once. there may be advanced ways to do this that we haven;t discovered yet,

hansu commented 1 year ago

It's also possible with Gmoccapy, with the same commands for Axis. I just couldn't find it here having only a phone.

knipknap commented 1 year ago

Perhaps it is better to disable this function by-default for now? This looks like quite a problem.

@hansu Perhaps you mean using "(AXIS,hide)" in gcode, but that removes the preview entirely. Since that also means you cannot see the path bounds anymore (making setting your file up to avoid workholding hard or even impossible in some cases) I think that is not a workable solution.

c-morley commented 1 year ago

Modifying glcannon in that way is not the right fix. You will break qtvcp's optional behavior.

Modify gremlin's select_fire() function for gladevcp based screens. Surely AXIS's select_fire() function would work for AXIS.

There is probably other ways too, just make sure you don't break one screen's function while fixing another :)

knipknap commented 1 year ago

Of course it isn't a fix, but I think it is better than breaking LinuxCNC for common cases.

Modifying select_fire() wouldn't fix it. glcannon.select() is also called in other situations, such as an auto-tool-change. Not sure where that is invoked, but there are many other triggers, leading to the UI being unusable during running jobs. I actually had a tool crash because of that, due to the pause button having a delay of 10 minutes(!).

c-morley commented 1 year ago

I didn't mention glcannon.select() - you are in the wrong layer of libraries IMHO. glcanon is imported into AXIS, GMoccapy, Gscreen, Gladevcp and Qtvcp - that is where I am suggesting the fix. it makes it ignore (only ignore) the mouse button selection that records the mouse position that starts the search.

c-morley commented 1 year ago

In Qtvcp I added code that would make it always ignore the selection when running a program in auto mode, I also added the option to completely ignore it if required (by a checkbutton in QtDragon's case)

knipknap commented 1 year ago

But shouldn't a workaround be implemented near the code that causes the problem? Shifting this responsibility one layer up would make sense if switching off highlighting were a permanently required feature. But the cause is an issue/bug, and switching off highlighting merely a workaround.

c-morley commented 1 year ago

Fot GMoccapy the select_fire() change would need to be added to gremlin.py then code added to gmoccapy.py to select/deselect that code.

see qt5_graphics.py for the idea.

knipknap commented 1 year ago

Like I explained in my comment above, modifying select_fire() in gremlin.py doesn't fix it. I tried this first when I started profiling to find the cause, but found quickly that the problematic code is also invoked in other places.

c-morley commented 1 year ago

But shouldn't a workaround be implemented near the code that causes the problem? Shifting this responsibility one layer up would make sense if switching off highlighting were a permanently required feature. But the cause is an issue/bug, and switching off highlighting merely a workaround.

Depends if you wish to allow different screens to have different behaviors. It's an issue for you because your program is huge, some people don't use huge programs.

c-morley commented 1 year ago

Like I explained in my comment above, modifying select_fire() in gremlin.py doesn't fix it. I tried this first when I started profiling to find the cause, but found quickly that the problematic code is also invoked in other places.

Please explain more. Are you saying modifying select_fire doesn't change anything or enough? Do you have a file we can test with?

knipknap commented 1 year ago

Depends if you wish to allow different screens to have different behaviors. It's an issue for you because your program is huge, some people don't use huge programs.

I completely agree that UIs should be able to choose to use the feature, but sometimes bugs don't allow for a perfect situation. What I am saying is that I would weigh "nice to have" highlighting against breaking LinuxCNC in what is IMO a very common scenario. Checking my last ten jobs, 4 of them have a file size >3MB. My next one is 30MB. And it's not even a complex part.

If your files are always smaller I'm curious what you are using LinuxCNC for, because if you use adaptive clearing operations you will almost certainly hit file sizes that currently just don't work anymore - just because of non-essential highlighting.

Please explain more. Are you saying modifying select_fire doesn't change anything or enough? Do you have a file we can test with?

No, I am saying that the glcanon.select() function isn't just invoked from gremlin.select_fire(). It is also invoked in other places. So while changing select_fire() fixes the problem of dragging in the preview, it does not fix the other problems I have mentioned:

such as an auto-tool-change. Not sure where that is invoked, but there are many other triggers, leading to the UI being unusable during running jobs. I actually had a tool crash because of the pause button having a delay of 10 minutes(!).

c-morley commented 1 year ago

It's not necessarily 'nice to have' - I believe it's used for run from line selection in some screens. Just because people may use linuxcnc differently doesn't make it invalid. ie lathe code is usually not MBs long.

Well I would need specifics to dig into it ie explain the auto-tool-change problem in as much detail as possible. glcannon.select() function is only to find a specific gcode line, not sure nor can find where it is called other then mouse selection, but the code is complex.

We could fix it similarly to your suggestion - it means we have to audit all the screens to make sure we don't break behavior.

knipknap commented 1 year ago

Alright, IMO a workaround is better placed at the source than in every "third party", but I'll stop pushing ;-). I have a fix on my machine, so don't mind one way or the other.

andypugh commented 1 year ago

I actually had a tool crash because of that, due to the pause button having a delay of 10 minutes(!).

Just as a general point, if at all possible, don't rely on user-space (UI, non-realtime) inputs for important functions.

A hardware switch connected to the "motion" pins ( https://linuxcnc.org/docs/stable/html/man/man9/motion.9.html ) will always work, with no more than 1ms delay. Unfortunately there isn't a pause pin there. There is one in halui ( https://linuxcnc.org/docs/stable/html/man/man1/halui.1.html ) but halui is also a user-space programme. Though it will generally be a lot less likely to be delayed by the system being busy than user-interface interaction.

c-morley commented 1 year ago

To be clear - I want the problem fixed as best we can. My point is the code is complex and interrelated. You troubleshooting fix is an excellent find - but it's nuclear. I had a similar complaint with qtdragon and seemed to have fixed it satisfactory for a year( 2 ?) or so. So we need to know if this is a new problem or just a problem that wasn't fixed on the rest of the screens or just a situation that wasn't tested for. For instance reloading the screen on very large programs probably takes some longer time - does auto probe tool do that - is that some of the problem? We also don't want to break other behavior if we don't have to.

You say that glcanon.select() is invoked in other places - do you know that by fact (ie you can point to a line number in a file) or by the observation that code changes in gremlin didn't completely fix the problem? The more specific the information you can give the easier it is to decide the best solution.

And thank you for reporting your trouble.

knipknap commented 1 year ago

I debugged by putting a print() into gremlin.select_fire(), and noticed that it isn't always called when the issue occurs.

I also placed Python's traceback.print_stack() function in glcanon.select() and IIRC I found a call happening from some Logger class, but I didn't dig too deep because I didn't know what the purpose of most of the classes is.

Note I did most my debugging on LinuxCNC 2.8.4, not on master.

rmu75 commented 1 year ago

I actually had a tool crash because of that, due to the pause button having a delay of 10 minutes(!).

Just as a general point, if at all possible, don't rely on user-space (UI, non-realtime) inputs for important functions.

A hardware switch connected to the "motion" pins ( https://linuxcnc.org/docs/stable/html/man/man9/motion.9.html ) will always work, with no more than 1ms delay. Unfortunately there isn't a pause pin there.

Isn't this what motion.feed-inhibit is for?

andypugh commented 1 year ago

Isn't this what motion.feed-inhibit is for?

I don't think that it is exactly the same as "pause" though I guess the result is similar.

It might be wise to set both, but I have a feeling (would need to test to be sure) that pause latches on, whereas feed-inhibit would release as soon as the button was released.

LinuxCNC / linuxcnc