Peter-van-Tol / LiteX-CNC

Generic CNC firmware and driver for FPGA cards which are supported by LiteX
GNU General Public License v3.0
56 stars 22 forks source link

starting LinuxCNC fails with new driver #28

Closed OJthe123 closed 1 year ago

OJthe123 commented 1 year ago

Hi. I did "git pull" to get the updated files from 11-add-external....

then installed the new drivers "litexcnc install_driver" rebuild the firmware.

Then LinuxCNC won't start anymore. linuxcnc-report.txt

when I do halrun and load the driver it has no errors loadrt litexcnc loadrt litexcnc_eth connection_string="192.168.178.150"

Do I have to change something in the INI for the new driver version?

ozzyrob commented 1 year ago

This is the interesting bit on line #80 Waiting for component 'inihal' to become ready......................................A configuration error is preventing LinuxCNC from starting.

It's really hard to debug without all the ini & hal files.

Peter-van-Tol commented 1 year ago

Thank you for your response @ozzyrob. It is indeed a configuration error, because the card does connect and buffers are being allocated and after all cleaned up.

Please share your ini and Hal files.

OJthe123 commented 1 year ago

Sure. I should have attached the files...

The thing is, I didn't change anything. You say you only did bug fix for overflow in the stepgen and something for RIP install? So, that should have nothing to do with my hal and ini, or? Semse.hal.txt Semse.ini.txt

Peter-van-Tol commented 1 year ago

Let me check one thing. Did you come from the main branch or did you update from branch 11?

If you updated from branch 11: yes only the Rip and overflow have been changed.

If you come from the main branch: the config file should be updated and the firmware rebuilt.

Can you also share you're config json?

At this moment I'm on holiday, as soon as I'm back I will build your config and investigate.

Also if there are any logs available from LinuxCNC, please share. Somewhere I might have the feeling a pin has been renamed somewhere in the process. (yes, that's on me)

OJthe123 commented 1 year ago

Yes. I come from the 11-add-externals. I used the new json format. It works fine. Only when I pull the last update I cannot start LinuxCNC

ozzyrob commented 1 year ago

If you can describe your system, version of Linuxcnc, Debian version & kernel version, and your method of installing Linuxcnc I can try and replicate. I would also require your json file and the following mentioned in your ini file

HALFILE = custom.hal POSTGUI_HALFILE = postgui_call_list.hal SHUTDOWN = shutdown.hal

If that is ok with you Pete. I'm unemployed ATM and looking to my mind occupied.

OJthe123 commented 1 year ago

All the information should be in the report I attached in the first post above. I cloned the repo with git clone --single-branch 11-external.... poetry Install poetry shell pip3 Install click pip3 Install yapps litexcnc install_driver

The custom and other hal Files are empty. It is a very basic LCNC setup just for testing

Peter-van-Tol commented 1 year ago

@ozzyrob: I'm completely okay with that when you can try and help others. On the other hand, hope your in a job soon again.

ozzyrob commented 1 year ago

Cheers Pete

Yeah sorry mate, you're right, was first thing in the morning down under. Just still need the json file so I can build the firmware and flash the fpga.

OJthe123 commented 1 year ago

Sorry. I am not at home till end of month. I used the example json. Just added the index pins to encoders. 3 pwm. 4 stepgens.

ozzyrob commented 1 year ago

I'm going take a stab at this and suggest it could be a latency issue. Have you tried isolcpus=1 when booting ?

The reason I think this I tried a simple sample config that uses the hal_speaker component and as my kernel was non real time I was getting these errors.

waiting for s.joints<0>, s.kinematics_type<0> waiting for s.joints<0>, s.kinematics_type<0> waiting for s.joints<0>, s.kinematics_type<0> waiting for s.joints<0>, s.kinematics_type<0> waiting for s.joints<0>, s.kinematics_type<0> waiting for s.joints<0>, s.kinematics_type<0> USRMOT: ERROR: command timeout

OJthe123 commented 1 year ago

It could be. But as I said, I did not change any other thing. I can start and use LinuxCNC with LiteX with no problems when I switch back to the "old" drivers

ozzyrob commented 1 year ago

Can confirm main branch is ok, although I get Apply time exceeded limits with 2.9 least linuxcnc loads, haven't tested 2.8 yet.

With 11-add-external on Buster with Linuxcnc 2.8 & Bookworm with Linuxcnc 2.9 sometimes I'm seeing errors as mentioned by OP, sometimes the system is just "freezing".

OJthe123 commented 1 year ago

When I am back at home I will try to go step by step through the components which causes my error

ozzyrob commented 1 year ago

No probs mate, thanks for all your hard work.

Any thoughts on "Apply Time exceeds limits" on the main branch ?

Peter-van-Tol commented 1 year ago

The apply time exceeding limits can be due to:

I did not experience any lock up though on my machines. I'm running on a RPi 4 using isolcpus=1,2,3 for best latency results. Although isolcpus=2,3 also yields good results. Generally it is recommended (no source, this is from the top of my head) to isolate a pair.

My response time might be higher due to holiday. My apologies for any inconvenience.

ozzyrob commented 1 year ago

No rush mate, it's all sweet & cruisy Down Under, no need to apologise. Enjoy your holiday. To be fair I'm testing on Lenovo T530 with a Dual core i5 (it's my favourite, but shhh don't tell my other computers ;) ). The real machine will be a PC with a quad core i5. I just use the laptop as I sit in the living room rather than isolated somewhere else.

So I'll do a bit more testing Tomorrow.

OJthe123 commented 1 year ago

I also have plenty of time with this. My lathe don't need this Speed when overflow can occur. And I don't use a RIP. I am fine with the Version that works for me πŸ˜„πŸ‘

Peter-van-Tol commented 1 year ago

Back from my holiday and sadly: I can reproduce this bug. Starting my machine leads to freezing of LinuxCNC. Seems that some processes are not running any more.

Edit: I have to investigate it further. When disabling litexcnc completely, it still fails to start.

OJthe123 commented 1 year ago

In the LCnc Forum someone mentioned that this happens when he add stepgens to the Json config

Peter-van-Tol commented 1 year ago

At this moment I'm thinking my installation is completely broken:

Reinstalled LinuxCNC to no avail. Today I'm going to format the image to see whether I can create a working config again....

ozzyrob commented 1 year ago

You think it's corrupting something in the actual Linuxcnc installation ?

OJthe123 commented 1 year ago

https://www.dropbox.com/s/h1v0j1btdzi96ia/VID_20230804_093236.mp4?dl=0

Hey. I think it is time to show the world that this project is not only a bugs and feature request πŸ˜„

This is with 11-external before the last update

ozzyrob commented 1 year ago

So 11-external was working ? Looking good. Have you got any more detail of the X axis ? I'm trying to come up with something simple for my Myford ML/S7 Frankenstein.

OJthe123 commented 1 year ago

https://www.dropbox.com/s/e3lnlcgugkhil5h/IMG_20230804_094837.jpg?dl=0

https://www.dropbox.com/s/4x0orttlawtoq6c/IMG_20230804_094831.jpg?dl=0

https://www.dropbox.com/s/m911beann6je7yi/IMG_20230804_094825.jpg?dl=0

It is a jmc ihsv57 180w servo with 1204 kus spindle and a selfmade mounting plate

ozzyrob commented 1 year ago

Cheers, nice solution.

ozzyrob commented 1 year ago

OK rolled back 11-add-external-extensions-to-litexcnc to:

commit ba57141686940a113f1d2394c17f069025eb3770
Author: Peter van Tol <petervantol@gmail.com>
Date:   Wed Jul 12 10:33:57 2023 +0200

    pip vs. pip3

Was able to get the config running, but on a quad core Intel Core i5-3470 with 3 cores isolated I was stil getting

Litexcnc: Apply time exceeded limits.litexcnc: Apply time exceeded limits.
Apply time exceeding limits (too long): 69026366277, 69026365867, 69026405879

That was with only Linuxnc being run from a terminal. What should the watchdog be set at ?

Peter-van-Tol commented 1 year ago

@OJthe123 : how would you like the idea of crating a show your machine page in the documentation?

@ozzyrob : this is one of the problems which has been resolved between your rollback version and the current version. Because I'm experiencing the same problem (reinstalling atm), I hope the problem is fixed. Otherwise I'll make a patch for you.

OJthe123 commented 1 year ago

@Peter-van-Tol : Sure, no problem. What do you need?

I also have the "Apply time.." info. But I cannot say that has any impact on my machine... But what I noticed, is that the calculated(?) encoder.velocity is 25% higher than the actual servo speed. I scale it down with the position-scale to fix it at the moment. Could be calculation error, or really the servo speed is off. I have no other possibility to measure it

OJthe123 commented 1 year ago

Here are my machine files... semse.7z.zip

Peter-van-Tol commented 1 year ago

Spend today reinstalling LinuxCNC on my RaspberryPi, bu to no avail. Something has changed apparently and prevents the real-time components to start (i.e. emcTrajInit failed)...

edit: installed the following versions:

Both give the same error on my RPi, how is that possible?

ozzyrob commented 1 year ago

Any luck yet ? Sounds like a real PITA. :(

OJthe123 commented 1 year ago

Just for more success story πŸ˜„

G76 Threading cycle also works

https://www.dropbox.com/scl/fi/gd01e3tdoovll69d235b6/VID_20230808_141143.mp4?rlkey=5lxysurznfe3msimlakpjrkz7&dl=0

Peter-van-Tol commented 1 year ago

@OJthe123 : Nice

In the meanwhile, my RPi is showing signs of life again, no errors when starting a simple configuration. Now going to install LitexCNC and re-build... The error was in the end PEBKAC (configuration error)

Peter-van-Tol commented 1 year ago

Recompiled everything and can reproduce the error with the following hal file

loadrt [KINS]KINEMATICS
loadrt [EMCMOT]EMCMOT servo_period_nsec=[EMCMOT]SERVO_PERIOD num_joints=[KINS]JOINTS

# Connection to the board
loadrt litexcnc extra_modules="toolerator" connections="eth:10.0.0.10"

# Assign to threads
# - LitexCNC
addf EMCO5.read                    servo-thread
# - MOTMOD
addf motion-command-handler        servo-thread
addf motion-controller             servo-thread
# - LitexCNC
# addf EMCO5.write                   servo-thread

The above hal file will run without errors. However, as soon as I enable the write function I get the error:

USRMOT: ERROR: command timeout
emcMotionInit: emcTrajInit failed
Waiting for component 'inihal' to become ready.

It boils down to something which must have changed in the write function which prevents the module from starting. Will further investigate where the write-function fails by shutting down all components and then re-enabling them one by one.

Edit 1: Found the culprit in litexcnc.c in the line:

static void litexcnc_write(void *void_litexcnc, long period) {
    litexcnc_t *litexcnc = void_litexcnc;

    // Check whether the write has been initialized AND the read and write functions
    // are in the recommended order (first read, then write). In the first loop the
    // we don't write any data to the FPGA, but it is configured. This is required,
    // because the configuration requires the period to be known, which prevents the
    // configuration to be performed before the HAL-loop starts
    if (!litexcnc->write_loop_has_run) {
        // Check whether the read cycle has been run, if not, the order is not correct
        if (!litexcnc->read_loop_has_run) {
            LITEXCNC_WARN("Read and write functions in incorrect order. Recommended order is read first, then write.\n", litexcnc->fpga->name);
        }
        // Configure the FPGA and set flag that the write function has been done once
        litexcnc_config(void_litexcnc, period); // <== This line blocks the starting of the FPGA and the time-out
        litexcnc->write_loop_has_run = true;
        return;
    }

Edit 2: Found the misbehaving module: stepgen. While determining the best pick-off (and thus the best accuracy) it gets into a infinite loop in some cases...

ozzyrob commented 1 year ago

Was just going to mention that the code in litexcnc.c is the same in the commit ba57141686940a113f1d2394c17f069025eb3770. And that works apart from the apply time messages.

Peter-van-Tol commented 1 year ago

Took some effort, but have found the error. If you pull the latest version of the branch #11 your LinuxCNC should start up again.

ozzyrob commented 1 year ago

Fantastic will try in the morning

ozzyrob commented 1 year ago

Ok gave it a go, tried with the OP's configs.......damned if I could get rid of the following error. Latency is good, I can run a config using steppers with a 25us base thread on this machine. Ping times are good. Tried isolating various cores (4 core i5)

But after that whinge it does start up, just can't jog.

Peter-van-Tol commented 1 year ago

The following error should be gone by the latest commit. There was a difference between Python (firmware) and C (driver) in determining the pick-off. For slow movements this could be compensated by the PID or pos2vel. However, for faster speeds the difference became to big.

The current commit has been tested on my EMCO5, which shows no following error when trying 1500 mm/min whilst using pos2vel as translation between position and velocity.

Peter-van-Tol commented 1 year ago

Continuing from #29 ...

With the config and hal-files from @OJthe123 I can now re-produce the problem. The difference between my setup and his is mainly the scale. Now that has been sorted out, I can start debugging. Just want to close this issue in a proper manner...

I have suggested that the problem might be with using the pin position-feedback instead of position-prediction. At this moment this seems to resolve the problem in my set-up, at least for having a following error. However, both using PID as well asnpos2vel the machine starts to oscillate when the jogging stops.

My observations are:

EDIT Not committed yet, but I got a rock-solid version of pos2vel working at this moment. It is more based on the way LinuxCNC stepgen behaves. Have the feeling that LinuxCNC is tuned on its own stepgen. Upcoming changes will be:

EDIT 2 Finished the re-write of litexcnc_stepgen.c. Tonight I will test this modification (it is a real big clean-up) with loads of enhancements. It does compile, but during the day no way to test it on my equipment.

OJthe123 commented 1 year ago

Awesome work! Do you think it is better to use the pos2vel / position-control in general? I do not really have a tuned PID setup. It is more just a P setup for loop back to the FPGA .

I finished 400 little parts today, which I turned on my lathe. No single problem with LiteX and Colorlite board

Peter-van-Tol commented 1 year ago

Just committed the changesπŸ˜ƒ:

For an advice on position vs pid: if your setup works, there is no need to change. However, the readability and maintainability of the HAL-file will improve when using the position control. The code below is the minimal example for a single axis, which is roughly 50% reduced in size when compared to a solution with pid or pos2vel.

    STEPGEN - X-AXIS
    ########################################################################
    # - Setup of timings
    setp [LITEXCNC](NAME).stepgen.00.position-scale   [JOINT_0]SCALE
    setp [LITEXCNC](NAME).stepgen.00.steplen          5000
    setp [LITEXCNC](NAME).stepgen.00.stepspace        5000
    setp [LITEXCNC](NAME).stepgen.00.dir-hold-time    10000
    setp [LITEXCNC](NAME).stepgen.00.dir-setup-time   10000
    setp [LITEXCNC](NAME).stepgen.00.max-velocity     [JOINT_0]MAX_VELOCITY
    setp [LITEXCNC](NAME).stepgen.00.max-acceleration [JOINT_0]STEPGEN_MAXACCEL
    # setp [LITEXCNC](NAME).stepgen.00.debug 1
    # - Connect velocity command
    net xpos_cmd joint.0.motor-pos-cmd => [LITEXCNC](NAME).stepgen.00.position-cmd
    net xpos_cmd joint.0.motor-pos-fb  <= [LITEXCNC](NAME).stepgen.00.position-prediction
    # - enable the drive
    net xenable joint.0.amp-enable-out => [LITEXCNC](NAME).stepgen.00.enable

I would really appreciate if you would test this latest version, so this issue can be closed as resolved.

OJthe123 commented 1 year ago

You rock dude! changed the drivers and build new firmware. Tested my setup with 3000mm/min. No errors. Maybe it could be faster, but I did not want to kill my maschine in case of a bug in firmware or driver πŸ˜†

EDIT: just for those who will copy & paste, there are two typos. corrected below.

`

 STEPGEN - X-AXIS
    ########################################################################
    # - Setup of timings
    setp [LITEXCNC](NAME).stepgen.00.position-scale   [JOINT_0]STEP_SCALE   # typo
    setp [LITEXCNC](NAME).stepgen.00.steplen          5000
    setp [LITEXCNC](NAME).stepgen.00.stepspace        5000
    setp [LITEXCNC](NAME).stepgen.00.dir-hold-time    10000
    setp [LITEXCNC](NAME).stepgen.00.dir-setup-time   10000
    setp [LITEXCNC](NAME).stepgen.00.max-velocity     [JOINT_0]MAX_VELOCITY
    setp [LITEXCNC](NAME).stepgen.00.max-acceleration [JOINT_0]STEPGEN_MAXACCEL
    # setp [LITEXCNC](NAME).stepgen.00.debug 1
    # - Connect velocity command
    net xpos_cmd joint.0.motor-pos-cmd => [LITEXCNC](NAME).stepgen.00.position-cmd
    net xpos_fb joint.0.motor-pos-fb  <= [LITEXCNC](NAME).stepgen.00.position-prediction   # typo
    # - enable the drive
    net xenable joint.0.amp-enable-out => [LITEXCNC](NAME).stepgen.00.enable`
ozzyrob commented 1 year ago

Sorry guys for being a bit quiet, I’ve been playing with some 7c81 firmware on a Spartan 6 dev board. When I get the chance I’ll setup my machine that I use for testing.

ozzyrob commented 1 year ago

Really Happy Pete. I owe your at least a Beer Had a play around no following errors, running the sample code was passed, jogging via keyboard passed, MDI passed, only issue was on shutdown. Starting up again runs fine but still quits with the same message.

_Shutting down and cleaning up LinuxCNC... task: 48698 cycles, min=0.000007, max=0.097561, avg=0.009917, 0 latency excursions (> 10x expected cycle time of 0.010000s) litexcnc/Semse: Watchdog timeout not set. Using default value 0 ns (3 times period).litexcnc: LitexCNC etherbone driver unloaded rtapi_app: caught signal 11 - dumping core

:0: exit value: 255 :0: rmmod failed, returned -1 Waited 3 seconds for master. giving up. Note: Using POSIX realtime motmod: not loaded :0: exit value: 255 :0: rmmod failed, returned -1 Note: Using POSIX realtime trivkins: not loaded :0: exit value: 255 :0: rmmod failed, returned -1 Note: Using POSIX realtime homemod: not loaded :0: exit value: 255 :0: rmmod failed, returned -1 Note: Using POSIX realtime tpmod: not loaded :0: exit value: 255 :0: rmmod failed, returned -1 :0: unloadrt failed Note: Using POSIX realtime_
OJthe123 commented 1 year ago

can confirm... When I run from terminal I can see the same output....still can see no effects on the maschine...

Shutting down and cleaning up LinuxCNC...
Running HAL shutdown script
task: 603 cycles, min=0.000041, max=0.012258, avg=0.009716, 0 latency excursions (> 10x expected cycle time of 0.010000s)
mb2hal quit_signal DEBUG: signal [15] received
mb2hal quit_cleanup DEBUG: started
mb2hal quit_cleanup DEBUG: unloading HAL module [16] ret[0]
mb2hal quit_cleanup DEBUG: done OK
mb2hal main OK: going to exit!
litexcnc: LitexCNC etherbone driver unloaded 
rtapi_app: caught signal 11 - dumping core
free(): invalid pointer
<commandline>:0: exit value: 255
<commandline>:0: rmmod failed, returned -1
Waited 3 seconds for master.  giving up.
Note: Using POSIX realtime
motmod: not loaded
<commandline>:0: exit value: 255
<commandline>:0: rmmod failed, returned -1
Note: Using POSIX realtime
trivkins: not loaded
<commandline>:0: exit value: 255
<commandline>:0: rmmod failed, returned -1
<commandline>:0: unloadrt failed
Note: Using POSIX realtime
Peter-van-Tol commented 1 year ago

This error is due to an old loadrt statement in your hal-files. You have now:

loadrt litexcnc
loadrt litexcnc_eth connection_string="192.168.178.150"

This should be combined to the following single statement:

loadrt litexcnc connection_string="eth:192.168.178.150"

Why this error emerges at this moment? It is because the FPGA is reset to its safe state when LinuxCNC is unloaded. This means that litexcnc will send a last message to the FPGA. When the FPGA is loaded using two separate statements, the etherbone driver is already unloaded (and memory thus freed up). Thus writing to a closed device, without allocated memory leads to a core dump.

I will close this issue, as the original problem has been solved. In another issue I will unpublish the litexcnc_eth component, so it cannot be inadvertently used as a stand-alone component.

Peter-van-Tol commented 1 year ago

@ozzyrob : for beer that would be then a VB please 🍻 ...

But to be honest: the beer would be on me. Thank you for your support, testing and time spent to make this possible and closing this issue.