cms-gem-daq-project / ctp7_modules

0 stars 13 forks source link

Possible Bug/Unexpected Behaviour: GBT Phase Scan Will Always Have "error" Key #84

Closed bdorney closed 4 years ago

bdorney commented 5 years ago

Brief summary of issue

I think the GBT phase scan:

https://github.com/cms-gem-daq-project/ctp7_modules/blob/78bfc744a7ba1cdb9c3d320bcbf5585c38553ffc/src/gbt.cpp#L62-L89

Will always insert the error key into the RPC response. We will be varying the GBT phase here, and this will cause communication loss on VFATs on the corresponding e-link. When trying to perform the last check (slow control on that VFAT) this will cause a bus error and the error key will be inserted into the RPC response:

https://github.com/cms-gem-daq-project/ctp7_modules/blob/78bfc744a7ba1cdb9c3d320bcbf5585c38553ffc/src/utils.cpp#L271-L275

In this case this would not be an error; but expected from the procedure.

@lpetre-ulb

Types of issue

Expected Behavior

The error key should not be inserted into the RPC response when running a GBT phase scan when trying to read the CFG_RUN register.

Current Behavior

The error key will be inserted into the RPC response when trying to read CFG_RUN register.

Possible Solution (for bugs)

Is it possible to:

  1. Before reading CFG_RUN register check if the error key exists in the RPC message,
  2. Repeat step #1 after reading the CFG_RUN register.
  3. If error key exists, remove it...(although not sure if possible).

Alternatively calling code of gbtPhaseScan() should ignore the presence of error key if the message contained by that key contains the string CFG_RUN (or any case variation there of...)?

Context (for feature requests)

Phase scans place error key in the RPC response and either motivate calling functions to ignore this key or cause problems

Your Environment

jsturdy commented 5 years ago

Are you saying that the first snippet will cause the error key to be set in the second snippet (because of setting an improper phase at some point in the scan)? The second snippet is not optimal currently and has been the subject of some discussion of improvement (e.g., exceptions), as it sets a not-guaranteed-to-be-invalid response word, and also doesn't set the error key.

I would propose we open one of the issues that wer discussed in #72 re exception handling, and try to find the optimal way forward. Once a solution on that front is found, the scan algorithm can be adapted to incorporate any changes and ensure that the result is always a good phase or and error message. Probably the best way will be to implement the insertion of the error keyword in the highest level possible, either the RPC callback itself, or the function that the RPC callback directly calls, but I think we'll need to ruminate on this properly first.

bdorney commented 5 years ago

Are you saying that the first snippet will cause the error key to be set in the second snippet (because of setting an improper phase at some point in the scan)?

Yes lines 65-67 will inevitably set a bad phase. Then the read at line 82 will in the present implementation cause the error key to be set (second snippet).

Once a solution on that front is found, the scan algorithm can be adapted to incorporate any changes and ensure that the result is always a good phase or and error message.

The idea is to see both the good and bad phases. So if a phase is bad (which will in current implementation generate a bus error) it should not cause the error key to be inserted at line 82. That's the desired outcome.

I would propose we open one of the issues that wer discussed in #72 re exception handling...I think we'll need to ruminate on this properly first.

Yes we should open an issue (@lpetre-ulb?) some iteration will of course be needed.

In the meantime in https://github.com/cms-gem-daq-project/xhal/pull/99/ I would suggest that lines check on the error key (that my comment here is referencing) be surpassed fro the short term.

mexanick commented 5 years ago

Setting an error key while writing or reading is not a problem itself. It does not throw the exception and does not break the RPC connection. The caller can read the error message and act in a different way. For this particular case you can either simply ignore it or assign some status variable to say "this phase is bad". Another possible workaround can be implementing new method remove_key in RPCMsg and then when an undesired error setting is expected, the error key can be removed.

jsturdy commented 5 years ago

OK, moved to #85, however, https://github.com/cms-gem-daq-project/ctp7_modules/blob/78bfc744a7ba1cdb9c3d320bcbf5585c38553ffc/src/utils.cpp#L272-L272 is commented, so I don't think that in the current implementation, there is an operational problem

lpetre-ulb commented 5 years ago

Indeed, the current implementation works (I regularly perform phase scans with the commits you referred to during ULB GEB&OH quality control). I admit that I have not checked the readReg source code, but since it is currently commented, the error key is not set. However, as this behavior might change in the future, I think it is good to keep this potential issue in mind.

For a possible solution, I'm not convinced that removing the a RPC key is ideal. Couldn't we add an overload of this function which would not return an error, but would return the number of words read (C-like read function). The current function would still exist and could maybe throw an exception. But it might be best to discuss about that in https://github.com/cms-gem-daq-project/ctp7_modules/issues/85.

lpetre-ulb commented 4 years ago

Won't fix. This is not an issue in the legacy code and the templated RPC code is going to use exceptions.