LinuxCNC / linuxcnc

LinuxCNC controls CNC machines. It can drive milling machines, lathes, 3d printers, laser cutters, plasma cutters, robot arms, hexapods, and more.
http://linuxcnc.org/
GNU General Public License v2.0
1.76k stars 1.14k forks source link

command data safety between milltask and motion #2386

Closed devinor closed 1 year ago

devinor commented 1 year ago

The send commond code is as follows. In line 95, copy the struct into sharemem https://github.com/LinuxCNC/linuxcnc/blob/27546d2192c157a021ee4eed48e617aaabadc34d/src/emc/motion/usrmotintf.cc#L74-L117

The recive command code is as follows. Determine the data in the shared memory, if head is not equal tail, means has new command.

motion has just executed to 393 lines of code, because there is no new message, so head == tail, continue to execute, if at this time milltask executes to usrmotintf.cc::95 lines of code and assigns a value, will it cause an error in command.c::397 lines of code?

Motion may think that new data has come, and it is complete data, which will cause the following command to execute incorrectly

https://github.com/LinuxCNC/linuxcnc/blob/348b0d196d9fad662472589cc9882985732796d1/src/emc/motion/command.c#L381-L420

SebKuzminsky commented 1 year ago

The recive command code is as follows. Determine the data in the shared memory, if head is not equal tail, means has new command.

No, that's not what the head & tail mean. Each command has a unique head and tail, and they're the same. If head and tail are equal it means a single, complete command is in the emcmotCommand buffer. If Motion sees head and tail with different values, it means Motion checked while Task was busy updating the command. In this case Motion will not process the partial command, and will instead wait, and check back later to see if Task has finished copying it.

I think the communication between Task and Motion is safe, at least from the bug you're chasing here.

Here's how I think this code flows:

  1. Task calls usrmotWriteEmcmotCommand with a new emcmot_command_t in c.
  2. usrmotWriteEmcmotCommand gives the new command a unique head and matching tail, and a unique commandNum (usrmotintf.c lines 85-86).
  3. usrmotWriteEmcmotCommand copies the new command into the memory shared with Motion (usrmotintf.c line 95).
  4. usrmotWriteEmcmotCommand polls the status buffer (also shared with Motion), waiting for this new command's commandNum to be acknowledged by Motion (usrmotintf.c lines 100-102).

Meanwhile, Motion does this:

  1. Make sure there's a complete, consistent command in the shared buffer (command.c line 393).
  2. Make sure it's a new command, i.e. one we haven't handled and acknowledged already (command.c line 397).
  3. Begin but do not finish acknowledging the new command (command.c line 404).
  4. Handle the new command (command.c line 409 and onwards).
  5. Finish acknowledging the new command (command.c line 1911). This lets Task know Motion handled the command, and Task processes Motion's status for the command (usrmotintf.c lines 104-109).

So i think there's no bug here, because Task will only send one command, then wait for it to be acknowledged before sending another.

devinor commented 1 year ago

So i think there's no bug here, because Task will only send one command, then wait for it to be acknowledged before sending another.

I agree with you, Task only send one command, block and ensure the cmd is processed. but there maybe have bug.

Step1

In command.c line 393, if emcmotCommand->head equal emcmotCommand->tail, means has a complete cmd, will execute to line 397

Step2

In command.c line 397, check emcmotCommand->commandNum and emcmotStatus->commandNumEcho, The main purpose is to determine whether there are new command.

In follow case:

if Motion just wanted to execute step 2 (command.c line 397).

At the same time, Task may send new command to Motion, and execute *emcmotCommand = *c; will copy command into sharemem. In this case, emcmotCommand->commandNum may not equal emcmotStatus->commandNumEcho, but Task has not finished copying it. Because commandNum is at the beginning of the structure emcmotCommand, so it will be copied first.

if the size of emcmotCommand is very small, The copying process is very fast. may have no problem. if the size of emcmotCommand is very large, The copying process is slow, Motion maybe use incomplete command in next process.

Sorry, English is not my mother tongue, please bear with me

SebKuzminsky commented 1 year ago

At the same time, Task may send new command to Motion

Task will not send a new command until Motion reports that it is finished handling the current command.

Task's usrmotWriteEmcmotCommand sends a command and then blocks until Motion reports that it is finished handling the command and is ready for a new one. The blocking is handled by the "poll for receipt of command" loop: https://github.com/LinuxCNC/linuxcnc/blob/27546d2192c157a021ee4eed48e617aaabadc34d/src/emc/motion/usrmotintf.cc#L96-L113

Motion reports to Task that is done with a command by using the Status buffer, by copying the command's commandNum to the Status buffer's commandNumEcho. The status buffer is guaranteed consistent by a head/tail synchronization mechanism just like the command buffer.

So I think there is no bug here.

Thank you for caring about the correctness of LinuxCNC! If I'm misunderstanding the bug you're talking about, please educate me.

Sorry, English is not my mother tongue, please bear with me

Your English is totally fine, and certainly much better than my Chinese! :-)

devinor commented 1 year ago

Normally

Function emcmotCommandHandler is called periodically. If has no new command, emcmotCommand->head will equal to emcmotCommand->tail (command.c line 393). Then Motion will check new commmand (command.c line 397) If emcmotCommand->commandNum equal emcmotStatus->commandNumEcho, means no new command. The function will return and do nothing.

My doubts

When Task did not send a command, Motion has executed command.c line 393, and ready to execute command.c line 397.

In this time, If Task send a command, and execute *emcmotCommand = *c;, Motion maybe check emcmotCommand->commandNum is not euqal emcmotStatus->commandNumEcho, because the copying process has started but not finished.

Motion will mistakenly think there are new command, and will handle new command (command.c line 399), with the copying process not finished.

Your English is totally fine, and certainly much better than my Chinese! :-)

Thank you for your enthusiasm and patience

SebKuzminsky commented 1 year ago

Ah, I finally get what you're saying, I see the race condition you're describing now:

  1. Motion enters emcmotCommandHandler, executes line 393 and finds head == tail so it knows that emcmotCommand is not in the middle of an update.
  2. Task enters usrmotWriteEmcmotCommand and begins executing line 95 (*emcmotCommand = *c;) to copy a new command into emcmotCommand, but does not finish, so emcmotCommand is in an inconsistent, partially updated state.
  3. Motion continues executing emcmotCommandHandler and gets to line 397, where it checks emcmotCommand->commandNum and finds the new commandNum (though the rest of the struct is inconsistent, because it's still in the process of being copied by Task).
  4. Motion starts processing the inconsistent emcmotCommand, executing a buffer containing part of the previous command and part of the new command.

Yes, that's terrible!

Thank you for your patience in explaining the problem to me.

I think your proposed fix in #2412 does not fix the problem... I think it prevents Motion from ever processing any new commands.

Perhaps emcmotCommandHandler should start by making a copy of whatever is in emcmotCommand, right at the very top, and then use its unchanging internal copy for all consistensy checks and processing?

SebKuzminsky commented 1 year ago

Hmm, no, simply making a copy in Motion won't fix the race:

  1. Task starts copying the new command into emcmotCommand, does the head and the commandNum but not the rest yet.
  2. Motion starts copying the new command from emcmotCommand, does head and commandNum and everything else, exept for tail, so its copy is mostly the old command.
  3. Task finishes copying into emcmotCommand, finally updating tail.
  4. Motion finishes copying from emcmotCommand, and gets the now-updated tail, even though the middle of the buffer is old.

In this case Motion would think it had a valid snapshot, but it's really corrupt.

I think the goal of the original authors was to have a lockless communication mechanism, but hear me out here... What if:

  1. Task protects writing to emcmotCommand by locking a mutex.
  2. Motion protects reading from emcmotCommand by trylocking the mutex. If it fails it means the mutex is locked by Task, which means Task is busy updating emcmotCommand. In this case Motion returns, and checks again on the next invocation (just like the current head/tail code tries unsuccessfully to do).
devinor commented 1 year ago

I think your proposed fix in https://github.com/LinuxCNC/linuxcnc/pull/2412 does not fix the problem... I think it prevents Motion from ever processing any new commands.

I'm working full-time at the moment, I haven't looked at your solution for a while. Please allow me to introduce my solution. The code snippet below is my solution.

void emcmotCommandHandler(void *arg, long servo_period)
{
    // .......

    if (emcmotCommand->commandNum != emcmotStatus->commandNumEcho) {
      return;
    }

    /* check for split read */
    if (emcmotCommand->head != emcmotCommand->tail) {
    emcmotInternal->split++;
    return;         /* not really an error */
    }
    if (emcmotCommand->commandNum != emcmotStatus->commandNumEcho) {
    /* increment head count-- we'll be modifying emcmotStatus */
    emcmotStatus->head++;
    // handle new command
    ......
  }
}

Step1

if (emcmotCommand->commandNum != emcmotStatus->commandNumEcho) {
  return;
}

If emcmotCommand->commandNum equal emcmotStatus->commandNumEcho, means not has new command, emcmotCommandHandler don't need to continue.

If emcmotCommand->commandNum not equal emcmotStatus->commandNumEcho, means has new command(copy has started),

need to check the complete of the new command (check copy finished or not )

Step 2

if (emcmotCommand->head != emcmotCommand->tail) {

chekc the complete of the new command, if copy has been finished, ready to handle the command, if copy has not been finished, return, and check in next period

SebKuzminsky commented 1 year ago

Have you run your code? Does it work? The new check you added at the top seems backwards to me:

 if (emcmotCommand->commandNum != emcmotStatus->commandNumEcho) {
      return;
 }

If the command's commandNum is not the same as the status commandNumEcho, then return. So if a new command (which will have a new commandNum) comes in, it returns. Next time the check works the same way and it returns. It'll never process new commands.

I'm going to assume that you meant the opposite test: if the new command commandNum is equal to the status commandNumEcho (indicating it's a command we've already processed), then return. If the command numbers are different (indicating a new command), then continue with the function.

That seems like it should work.

devinor commented 1 year ago
void emcmotCommandHandler(void *arg, long servo_period)
{
    // .......

    if (emcmotCommand->commandNum != emcmotStatus->commandNumEcho) {
      return;
    }

    /* check for split read */
    if (emcmotCommand->head != emcmotCommand->tail) {
  emcmotInternal->split++;
  return;         /* not really an error */
    }
    if (emcmotCommand->commandNum != emcmotStatus->commandNumEcho) {
  /* increment head count-- we'll be modifying emcmotStatus */
  emcmotStatus->head++;
  // handle new command
  ......
  }
}

Sorry, I made a stupid mistake, sorry to waste your precious time.

code need to be modifyed as follow

     if (emcmotCommand->commandNum == emcmotStatus->commandNumEcho) {
       return;
     }
SebKuzminsky commented 1 year ago

I'm going to clean up the mutex branch and merge it to 2.9, per a hackfest committee decision.

SebKuzminsky commented 1 year ago

Fixed by #2479

devinor commented 1 year ago

Thanks. @SebKuzminsky 👍