[cleanup] Common blocks related to VFAT sync checking

Brief summary of issue

As started in #160 and expanded in #162, this is another cleanup operation, related to the following types of block (updated for new framework, but functionally unchanged):

  uint32_t goodVFATs = vfatSyncCheck(ohN);
  uint32_t notmask = ~mask & 0xFFFFFF;
  // FIXME, rather than this block, override mask, but the warning message was already sent, how to ensure it's noticed?                                                                                                                                                                                                                                                     
  if ((notmask & goodVFATs) != notmask) {
    std::stringstream errmsg;
    errmsg << "One of the unmasked VFATs is not sync'd: "
           << "goodVFATs: 0x" << std::hex << std::setw(8) << std::setfill('0') << goodVFATs << std::dec
           << "\tnotmask: 0x" << std::hex << std::setw(8) << std::setfill('0') << notmask << std::dec;
    throw std::runtime_error(errmsg.str());
  }

I was looking to see if this entire check could be moved into an updated vfatSyncCheck(ohN, vfatMask), however, we will probably always want vfatSyncCheck to return the syncMask, even in the event that some VFATs are not sync'd. This means that we can't throw the exception from vfatSyncCheck since we need the return value (goodVFATs). If this is the case, then we still would have to put this check into every function that wants to make the promise that it operated on all specified VFATs...

The second place this type of check shows up is inside loops over the VFATs

  for (size_t vfatN=0; vfatN < oh::VFATS_PER_OH; ++vfatN) {
    // Check if vfat is masked                                                                                                                                                                                                                                                                                                                                               
    if (!((notmask >> vfatN) & 0x1)) {
      continue;
    }

    // Check if vfatN is sync'd                                                                                                                                                                                                                                                                                                                                              
    uint32_t goodVFATs = vfatSyncCheck(ohN);
    if (!((goodVFATs >> vfatN) & 0x1)) {
      std::stringstream errmsg;
      errmsg << "The requested VFAT is not sync'd: "
             << "goodVFATs: 0x"       << std::hex << std::setw(8) << std::setfill('0') << goodVFATs << std::dec
             << "\t requested VFAT: " << vfatN
             << "\tvfatMask: 0x"      << std::hex << std::setw(8) << std::setfill('0') << vfatMask << std::dec;
      throw std::runtime_error(errmsg.str());
    }

Here I would raise the question of whether the call to vfatSyncCheck needs to be inside the loop or not (though it may vary on a case-by-case basis, depending on what is happening inside the loop)

Types of issue

[x] Request for comment

I'm wondering if a similar function/functor couldn't be used for automatic VAT mask determination instead of using a vfatMask parameter. One of the current drawback is that a simple VFAT mask hardly applies to multi-link methods. It would have to be a list of mask since they can be different between links. A dense VFAT masks vector would also work, removing the need of giving an OH mask.

However, I think it should be possible to drop the VFAT masks parameter altogether without sacrificing the functionalities. If the VFATs are included into a scan only if both the SYNC_ERR_CNT is 0 and the CFG_RUN register is 1, we should be able to implement the current functionalities:

Unsync'd or unstable VFATs wouldn't be scanned.
Skipping a VFAT can be done by setting it out of run mode.

Since the number of scanned VFATs wouldn't be known in advance, the calling code would have to be resilient enough. I think it is already the case.

Regarding you explicit question, I believe the VFAT synchronization should checked once outside of the loop in the general case. If the synchronization is lost during the scan, an exception (i.e. failing read or write) would be thrown anyway.

@lpetre-ulb there's such function: https://github.com/cms-gem-daq-project/ctp7_modules/blob/c80e01207f21afd768758cc51e8d5994496a31df/src/vfat3.cpp#L34 and https://github.com/cms-gem-daq-project/ctp7_modules/blob/c80e01207f21afd768758cc51e8d5994496a31df/src/vfat3.cpp#L19-L32 The vfatmask parameter is required in case you want to force it (e.g. scan only certain chips instead of all of them, which may lead to much longer scan time or to code crash because of bad sync)

The vfatmask parameter is required in case you want to force it (e.g. scan only certain chips instead of all of them, which may lead to much longer scan time or to code crash because of bad sync)

@mexanick, indeed. However, this does not mean that automatic VFAT mask is impossible. My proposal was to implement a new RPC method like this one:

std::uint32_t getVFATMask::operator()(std::uint32_t ohMask) const;

which does the following:

Checks for bad VFAT synchronization. Who would want to scan an improperly/unstable VFAT? It will only make the scan fail...
Checks that the VFAT is in run mode. Scanning a VFAT out of run mode is useless most of (all?) the time. If one wants to skip a VFAT, it is a simple as setting it out of run mode. I also strongly believe that the RPC modules shouldn't turn on and off VFATs. It is currently done is some methods, but totally inconsistently.

With this proposal, the scans would be performed only on properly synchronized VFATs which are in run mode. If a script needs to force a VFAT mask, it could use a setVFATMask() method which would check the VFAT synchronization and then set them in run mode. In the case where it is impossible (bad synchronization), this method can throw. Note that this latter method is not required and can already be implemented with the ones which already exist.

I totally agree with your proposal of new function/functor. In some sort this functionality is already implemented as I pointed above (+ some stuff in calibration routines), but the implementation could be sparse and inconsistent. I.e. my point was that new development isn't required, but rather refactoring is needed. Considering "setting VFATs on/off" - can you elaborate why you're strongly against it? Indeed, scanning a turned off VFAT doesn't make sense, however ignoring in a scan (for whatever reason) a VFAT in run mode might be dangerous at times due to possibility of the signal mirroring/xtalk. It is possible that I'm wrong on this, but this has to be carefully considered.

I want to flesh out three scenarios:

production: expectation is that all VFATs should be in a good state <--> default action
lab: possibly missing actual VFATs, but all other VFATs expected to be in a good state
test: possibly missing actual VFATs, don't really care if all other VFATs are in a good state, work with what is available

Whatever solution we end up with will need to be able to handle (at least) these modes. So if we pass a VFAT mask as we currently do, this would apply to missing or known problem VFATs, but something else must control whether the automatic determination can override this by removing more VFATs (it should probably never add VFATs)

Considering "setting VFATs on/off" - can you elaborate why you're strongly against it? Indeed, scanning a turned off VFAT doesn't make sense, however ignoring in a scan (for whatever reason) a VFAT in run mode might be dangerous at times due to possibility of the signal mirroring/xtalk. It is possible that I'm wrong on this, but this has to be carefully considered.

I must admit that I focused on the software design more than on the ASIC itself. I also don't know if setting the VFATs out of run mode actually protect them. However, my proposal is to scan only, but all, the VFATs in run mode and without synchronization error (that should be implied). So, the proposal matches your requirement.

One of the issues with turning on and off the VFATs is that this must be done very carefully within the RPC modules. The VFATs should be in the state state at the end of a scan. Another issue is that related to scripts and/or procedures involving multiple scans. What would be the rationale for turning on and off the VFATs between each scan? Since it was shown that the noise level changes when you set in and off of run mode the VFATs, I think it is preferable to keep them in run mode during the trimming for example. Brian also observed unexpected behavior after setting a VFAT in run mode:

https://github.com/cms-gem-daq-project/ctp7_modules/blob/c80e01207f21afd768758cc51e8d5994496a31df/src/calibration_routines.cpp#L1213

So, I believe this should be the responsibility of the script and/or online software.

In fact, this is part of the bigger question of the front-end state management. For me, the RPC scan functions should modify as little as possible the front-end state and always return it configured as it was before the scan. In my opinion, it is the easiest and safest way to track the state of the front-end and keep it consistent between scans. A scan RPC method is not a configuration RPC method.

Therefore, the methods could be split into two sets: the ones which modify the front-end state (biasAllVFATs or findDACValue if it is ever implemented) and those which do not (scanDACs for example). All the scans being in the latter set.

I want to flesh out three scenarios:

production: expectation is that all VFATs should be in a good state <--> default action

[...]

Whatever solution we end up with will need to be able to handle (at least) these modes. So if we pass a VFAT mask as we currently do, this would apply to missing or known problem VFATs, but something else must control whether the automatic determination can override this by removing more VFATs (it should probably never add VFATs)

Okay, so follow-up question on the production scenario. What should happen if a VFAT is not in a good state? There is already one VFAT with the issue at P5; hopefully it will be resolved :crossed_fingers:. But, can it be guaranteed that this will never happen after commissioning? Should the behavior be best effort or crash the scan (as soon as possible)?

I vote for "best effort" - we don't want to loose a precious operations (and completely valid information we are getting from "good-behaving" parts of the front-end) because some (hopefully small) portion of the system went into a wrong state.

Okay, so follow-up question on the production scenario. What should happen if a VFAT is not in a good state? There is already one VFAT with the issue at P5; hopefully it will be resolved crossed_fingers. But, can it be guaranteed that this will never happen after commissioning? Should the behavior be best effort or crash the scan (as soon as possible)?

For production, we will be operating under the assumption that if we say a VFAT is working, then for all intents and purposes, the VFAT should work, and if there is a problem, an error condition should be raised. I.e., if we're configuring for global data taking and one of the VFATs is not in sync, either we take an automatic action to correct this (we should be considering this), or we go to an error state, such that an updated configuration could be made to mask this VFAT. We definitely want to avoid the situation where the SW has a problem with some HW and just carries on, ignoring that HW, even if that information finds its way into a log somewhere. It's critical that if we end up running in a "degraded" state, this is clear and obvious in both the SW as well as to shifter/expert.

These sorts of operational and functional requirements need to be considered at every step in the design process.

N.B. also during operations, we'll be getting periodic HardReset signals which will reconfigure the front end settings, so we may see things get recovered by this, if a problem develops during a run. Again, the whole chain of operation needs to be taken into account when determining how the SW should behave

For production, we will be operating under the assumption that if we say a VFAT is working, then for all intents and purposes, the VFAT should work, and if there is a problem, an error condition should be raised. I.e., if we're configuring for global data taking and one of the VFATs is not in sync, either we take an automatic action to correct this (we should be considering this), or we go to an error state, such that an updated configuration could be made to mask this VFAT. We definitely want to avoid the situation where the SW has a problem with some HW and just carries on, ignoring that HW, even if that information finds its way into a log somewhere. It's critical that if we end up running in a "degraded" state, this is clear and obvious in both the SW as well as to shifter/expert.

I fully agree on the automatic recovering. However, should the system go to an error state in case, for example, a single VFAT looses its synchronization? Ot should it be only notified to the shifter/expert while continuing to take data. One could consider to use a best effort strategy, while reporting any error to the relevant people and moving to a degraded/error state. I'm sure there are already many parameters other than VFAT synchronization to monitor during operation.

Automatically masking the VFATs at the RPC module level is maybe not the way to go in this case. However, I'm convinced that the calling software should be able to retrieve which VFAT had an issue. Then, it could mask the VFAT (if it is the appropriate action) and move to an error/degraded state while continuing to take data with the remaining healthy system.

These sorts of operational and functional requirements need to be considered at every step in the design process.

N.B. also during operations, we'll be getting periodic HardReset signals which will reconfigure the front end settings, so we may see things get recovered by this, if a problem develops during a run. Again, the whole chain of operation needs to be taken into account when determining how the SW should behave

Hence, my question. ;)

I also believe there is not a single good answer. For example, in the CTP7 firmware, a VFAT is automatically masked from the DAQ event builder if it is detected as out-of-sync. This avoids timeouts which reduce the bandwidth and possibly other damaging consequences (e.g. broken event building).

Regarding the implementation of your production vs test stand requirement, I guess that we cannot make the RPC modules stateful and that we should also avoid using any configuration file?

Therefore, the simplest way is probably to not perform any automatic VFAT mask determination in the RPC module (and maybe not even a synchronization check) and rely on the higher level software to implement the desired behavior.

There would be 3 possible parameter sets:

ohN, vfatN
ohN, vfatMask
A list of vfatMask for multi-link methods since the VFAT masks can be different.

I've excluded (ohMask, vfatN), because it does not make much sense to me.

cms-gem-daq-project / ctp7_modules

[cleanup] Common blocks related to VFAT sync checking #163

Brief summary of issue

Types of issue