cms-gem-daq-project / ctp7_modules

0 stars 13 forks source link

Discussion: Software Routine for Determining Proper SBIT Timing Parameters #61

Open bdorney opened 5 years ago

bdorney commented 5 years ago

Brief summary of issue

So we have seen that we have an issue with the sbit mapping in V3 electronics. This issue persists when using GEBv3c+OHv3c hardware:

While the situation with complete v3c hardware is improved it is still not desired. Additionally this is just for the short detector and we will need a set of parameters also a long detector. Then for GE2/1 there will be 8 sets of parameters, and ME0 will contribute another set. So we need a software routine that can automatically determine the correct set of timing registers.

The registers of interest are:

According to @andrewpeck

{X} is the Optohybrid number, which is determined by the CTP7 fiber mapping. {Y} is the VFAT number in software units {Z} In the same convention that Tuomas explained, is TXD_{Z}. The firmware calls Z=0 s-bits 0-7 (corresponding to VFAT channels 0-15), Z=1 is s-bits 8-15 (corresponding to VFAT channels 16-31) and so on.

The 100-pin panasonic connector looks like:

geb-v3b-100pin

The convention for the trigger unit that Tuomas has explained (@andrewpeck's email above) is shown as:

geb_trigger_layout_100_pin_panasonic

So GEM_AMC.OH.OHX.FPGA.TRIG.TIMING.TAP_DELAY_VFATY_BITZ follows the hardware.

We already have one tool that checks the mapping:

https://github.com/cms-gem-daq-project/ctp7_modules/blob/e1d9d0c52a9bd5ffae96e99706ffd12b6ba2809b/src/calibration_routines.cpp#L963

I would be against modifying this tool to try to correct the mapping (since if you modify the 4 registers above incorrectly you can affect not just the Z^th bit but all 8 SBIts due to how the OH is expecting them. So what I would propose is the following procedure:

  1. When deploying new firmware or using new hardware for the first time the checkSbitMappingWithCalPulseLocal() function should be used to check that the sbit mapping is correct (this is easily done with checkSbitMappingAndRate.py,
  2. Analyze this data with anaSBitMonitor.py, this produces a list of mis-mapped sbits (see example here),
  3. Use this list as input for some new function correctSBitMappingErrorsLocal(...),
  4. Apply corrected timing and inverted register settings determined from analysis of correctSBitMappingErrorsLocal(...),
  5. Check the mapping is now correct with another call of checkSbitMappingAndRate.py.

The correctSBitMappingErrorsLocal(...) would call checkSbitMappingWithCalPulseLocal(), with a small event count, after making modifications so this could eliminate step 5 above.

Types of issue

Expected Behavior

How I expect correctSBitMappingErrors(...) and correctSBitMappingErrorsLocal(...) to function. General flow is shown below.

Unless otherwise noted for the code that will be added to ctp7_modules this should be placed in calibration_routines.h and calibration_routines.cc.

The calling function on the DAQ Machine

This is a new development and I'm not sure if the calling function should be created in the legacy xhal branch, or if it should be placed in cmsgemos (this in my eye is a calibration routine so it doesn't really fit in a HwDevice some input from @mexanick and @jsturdy would be appreciated here).

However general overview should be something like:

Input parameters should be:

Example table format is something like:

vfatN vfatSBIT SBIT_Size N_Mismatches
14 8 0 3428
14 8 1 2326
14 8 2 989
14 8 3 489
14 8 4 257
14 63 0 366
16 16 7 25200
16 40 7 25200
17 0 7 25200
17 16 0 4

Here any sbit with N_Mismatches beyond 25k can be assumed to have an inverted polarity but this strongly depends on the event count used when checkSbitMappingAndRate.py was called.

Outline of correctSBitMappingErrors(...)

Here we are getting the information from the RPC request, and it falls in two categories:

For the first case (wrong timing) we will construct a std::map<std::string,std::vector<uint32_t> > from the input MappingVFATN keys:

  1. For OH X of interest, loop over all 24 VFATs,
  2. For each vfatN check if a key exists "MappingVFATN" exists in the rpc message,
  3. If this key exists it gets a std::vector of mis-mapped sbits from the get_word_array function,
  4. This vector is stored in a std::map where the key is "MappingVFATN" or just "VFATN" for simplicity

Similarly we should construct a second map (as above) from the "InvertedVFATN" keys.

This should then get the vfatmask using:

https://github.com/cms-gem-daq-project/ctp7_modules/blob/e1d9d0c52a9bd5ffae96e99706ffd12b6ba2809b/src/amc.cpp#L42

It should then loop over all unmasked vfats and for each iteration it should call the local function and use the constructed maps as input. The local function correctSBitMappingErrorsLocal() which should take the following input parameters:

The local function could then return for this vfatN an std::map<std::string, std::vector > whose keys are:

std::map<std::string, uint32_t > map_sotTapDelay; //key here is `VFATN`, stores at most MAX_VFAT number of values (24 for GE1/1).
std::map<std::string, std::vector<uint32_t> > map_vfatTapDelay; //key here is `VFATN`, each vector has 8 elements
std::map<std::string, uint32_t> map_vfatInvert; //key here is `VFATN`, stores at most MAX_VFAT number of values (24 for GE1/1).

After everything is said and done there should be a read of SOT_INVERT and this should be placed in the RPC response as a data word.

Then the three final maps (map_sotTapDelay, map_vfatTapDelay, map_sotTapDelay) should be looped over (they will all have the same keys so one loop is sufficient) and stored in the RPC response, e.g.:

for(int vfat = 0; vfat < 24; ++vfat){
   std::string strVFAT = stdsprintf("VFAT%i",vfat);
   rsp.set_word(stdsprintf("SOT_TAP_DELAY_VFAT%i",vfat),map_sotTapDelay[strVFAT]);
   rsp.set_word_array((stdsprintf("TAP_DELAY_VFATY_BITS%i",vfat),map_vfatTapDelay[strVFAT]);
   rsp.set_word(stdsprintf("VFAT%i_TU_INVERT%i",vfat),map_vfatInvert[strVFAT]);
}

The function on the DAQ machine now has the correct configuration for this link.

Outline of correctSBitMappingErrorsLocal(...)

The local function will then be where the actual "meat" of the algorithm is done. This function should look like:

std::map<std::string,std::vector<uint32_t> > correctSBitMappingErrorsLocal(int ohN, int vfatN, std::vector<uint32_t> mismappedSBits, std::vector<uint32_t> invertedSBITs, bool correctMapping)

This function at the end should always read the following registers:

This could be done by having a dedicated RPC method in vfat3.h/vfat3.cc (for reading one VFAT) and optohybrid.h/optohybrid.cc (for reading all VFATs) and the one in vfat3.h should be called by correctSBitMappingErrorsLocal.

It should only try to correct the mapping if correctMapping is true.

First we should loop over those members of invertedSBITs and write the corresponding bits in VFATY_TU_INVERT. This should be done by:

  1. Determining which 24TU_TXD_P<N> and 24TU_TXD_N<N> pair the i^th element of invertedSBITs refers to using the convention @andrewpeck illustrates above.
  2. Then flip the bit in VFATY_TU_INVERT that corresponds to this 24TU_TXD_P<N> and 24TU_TXD_N<N> pair,
    • Note you need to track if a bit has already been flipped since there could be multiple elements in invertedSBITs that will share this pair and all need to be flipped, so once you flip the bit the first time, any other elements of invertedSBITs that correspond to this pair should not cause the bit to be flipped again
  3. Repeat steps 1 & 2 for all elements of invertedSBITs.

After this you should call:

https://github.com/cms-gem-daq-project/ctp7_modules/blob/e1d9d0c52a9bd5ffae96e99706ffd12b6ba2809b/src/calibration_routines.cpp#L946-L963

Care should be taken to construct the input arguments properly (see function documentation). Additionally you don't need a lot of events (nevts=10 is probably sufficient). Also using the calpulse in voltageStepPulse mode should be fine (e.g. useCurrentPulse = false). You then should analyze the outData container to see if any of the bits you flipped suffer from mis-mapping. To do this see this example:

For any new mismatches you find you should add these to mismappedSBits, e.g.:

mismappedSBits.push_back(outData[idxOfNewMisMatch]);

Now here is where the hard part is. For each element of mismappedSBits the delays should be such that:

Note I've suppressed the negative part of the pair. To ensure this you need to manipulate:

To accomplish this for the element of mismappedSBits. However other elements of mismappedSBits may share the same pair (e.g. 24TU_TXD_P<N> and 24TU_TXD_N<N> as the current element). So you should track which pairs you've already modified to prevent subsequent modification. Additionally, and more importantly, an element in mismappedSBits that is later on in the VFAT may be affected by your modification of an earlier bit. I would propose the following:

  1. Determine which differential pair the element of mismappedSBits comes from, this determines TAP_DELAY_VFATY_BITZ.
  2. Adding 1 to this TAP_DELAY_VFATY_BITZ register, (not sure the size of this register, but you should stop at the max...),
  3. For all subsequent TAP_DELAY_VFATY_BITZ registers where Z_prime > Z also add 1.
  4. Call checkSbitMappingWithCalPulseLocal(...) with a low event count,
  5. Decode the outData and remove any element from mismappedSBits which is now correctly mapped, add an sbit that is now incorrectly mapped, and
  6. Repeat steps 1-5 until all sbits have the correct mapping.

Some input here from @andrewpeck is needed to see if the above makes sense, particularly steps 2 & 3. For Step 5 I would suggest to use the Erase-Remove Idiom; you can find examples on stackoverflow.

Then afterward this function should read the following registers:

Store these in an std::map<std::string,std::vector<uint32_t> > and return it. The mapping should now be correct.

Current Behavior

You have to do the above by hand using gem_reg.py (bad).

Context (for feature requests)

The sbit mapping is wrong. We need a software solution to correct this for both GE1/1 and future upgrades (GE2/1 & ME0).

lpetre-ulb commented 5 years ago

May I suggest another solution for aligning the sbits ? I think we could find a firwmare/software solution which is more robust (and faster) than a pure software routine.

Here is how I see things :

  1. Instead of using fixed delay taps, one could dynamically configure the delays in order to always sampling the signal in the middle of the eye. We could implement a system similar to XAPP585, particularly per-bit deskew. It would be aligned in near realtime and could correct for voltage and temperature variations.

  2. Once we can reliably sample the time-multiplexed sbits, it would be possible to align all of them with a training phase. More specifically, configure the VFAT as follow : mask all channels except 0&1, 16&17, ... so the enabled sbits would be 0, 8, ... Set also the THR_ARM_DAC to a very low value (e.g. 0x1) in order to constantly measure noise. Therefore, on each sbit differential pair (and SOT also, I think), one would see 10000000. The signal is aligned using a simple bitslip. This is also where we see if there is an inversion in the polarity and correct for it.

  3. Once the training phase is done, the VFAT can return to "normal" operating mode. The alignment can be continuously checked by looking at the SOT frame.

I also think that such a solution can easily be ported to GE2/1 & ME0. In absence of an FPGA on the OH the step 1 should be done by the LpGBT (GBTX already phase aligns the data if I'm not mistaken). Steps 2 should be done in the backend firmware.

As the correct sbit mapping is required for the new TDC with full granularity, I would not be able to reliably test the new TDC firmware before the sbit mapping issue is solved. I could try to implement the previously described solution next week.

andrewpeck commented 5 years ago

Hi Laurent,

Thanks for the feedback. I have some comments inline below:

On Thu, Oct 25, 2018 at 2:43 AM lpetre-ulb notifications@github.com wrote:

May I suggest another solution for aligning the sbits ? I think we could find a firwmare/software solution which is more robust (and faster) than a pure software routine.

Here is how I see things :+1: Instead of using fixed delay taps, one could dynamically configure the delays in order to always sampling the signal in the middle of the eye. We could implement a system similar to XAPP585 https://www.xilinx.com/support/documentation/application_notes/xapp585-lvds-source-synch-serdes-clock-multiplication.pdf, particularly per-bit deskew. It would be aligned in near realtime and could correct for voltage and temperature variations.

The firmware right now uses dynamic configuration of the delays (based on https://www.xilinx.com/support/documentation/application_notes/xapp881_V6_4X_Asynch_OverSampling.pdf )

It centers the data inside the eye automatically, based on the SOT pulse which is received every clock. We wanted to phase length match the different S-bits coming from a VFAT so that in principle the same alignment state machine could be used for all 9 pairs coming from a single VFAT. The SOT would determine the timing and the corresponding S-bit traces would be automatically aligned to it because they have the same timing.

The phase alignment that was done on the PCBs however, is not good, so there is some skew from channel to channel and we hoped to just correct that with fixed delays that simply align the S-bits coming from a single VFAT so that they are in completely in sync with eachother. Temperature drift and so on should affect all 9 pairs equally (at least within the tolerance of the very large 3.125ns eye).

So the process of timing in these delays should just need to be done once in the lab and we are over with it for the whole detector, and do not need special routines at the beginning of every hard reset. We did it already on v3a by hand and it worked well but nobody ever repeated the exercise on v3b and v3c where some positions have changed.

The belief underlying this is that there may be VFAT to VFAT variation, GEB to GEB variation, but that the variation within a single VFAT should be small and this mechanism just needs to keep them in phase +- a nanosecond or so (using 78ps tap delays) so there is actually a lot of slosh for things to be out of time. The big requirement of this system is that the different output channels of a single VFAT should be consistently timed in with eachother when coming from the VFAT, which I really hope is true, and that the IODelays work more-or-less correctly within the slack acceptable by the sampling window (which they should, they are calibrated by the chip.

Once we can reliably sample the time-multiplexed sbits, it would be possible to align all of them with a training phase. More specifically, configure the VFAT as follow : mask all channels except 0&1, 16&17, ... so the enabled sbits would be 0, 8, ... Set also the THR_ARM_DAC to a very low value (e.g. 0x1) in order to constantly measure noise. Therefore, on each sbit differential pair (and SOT also, I think), one would see 10000000. The signal is the aligned using a simple bitslip. This is also where we see if there is an inversion in the polarity and correct for it.

This is basically just what we are trying to do right now with the script that Brian described, except not as an automatic routine but just something to derive constants for the firmware. This is a possibility of course, to have automatic alignment using some calpulses but I wanted to try to get this working on the boards without co-dependent CTP7 firmware and software routines that I have no control over. It seemed to work fine but if we run into problems perhaps we reconsider whether something like this is needed.

Once the training phase is done, the VFAT can return to "normal" operating mode. The alignment can be continuously checked by looking at the SOT frame.

I also think that such a solution can easily be ported to GE2/1 & ME0. In absence of an FPGA on the OH the step 1 should be done by the LpGBT (GBTX already phase aligns the data if I'm not mistaken). Steps 2 should be done in the backend firmware.

GBT does not phase align data. It has a similar fixed delay, and we do a hand-scan of phase values to find the window and then fuse a hard-coded sampling phase into the chip.

As the correct sbit mapping is required for the new TDC with full granularity, I would not be able to reliably test the new TDC firmware before the sbit mapping issue is solved. I could try to implement the previously described solution next week.

S-bit "mapping" only creates a rotation of the S-bits so that 01234567 becomes 7123456. The TDC just uses the OR of the entire VFAT, correct? In which case you should be able to proceed as is, right?

Fyi, there are several unrelated problems that are sometimes referred to as "S-bit mapping" but many of them seem to perhaps be unrelated to the timing/mapping but do fall under the umbrella of problems with S-bits.

In the plots shown previously by Brian, for example, of GEB v3b:

None of these problems seem to be what I would expect from timing issues. Either the VFAT or the OH seems to just be broken in slots 16, 17, 22, 14, or bad solder joints, etc.. Polarity inversion could explain the issue on VFAT14.

On the GEB v3c, you can see perhaps a timing issue in VFAT18, VFAT8, VFAT0, VFAT11, VFAT3 but all of these problems would not be an issue if you are just using the OR of the VFAT for timing measurement

VFAT14 has something else very wrong that could not be explained by timing or inversion.

All the issues with calpulses showing up in the wrong VFAT also should have nothing to do with mapping or timing and could be indicative of something like crosstalk (which we know exists, since we see S-bits coming from disconnected VFATs).

We will be working on the timing question in the next few weeks and should have an idea soon how well things work, how consistent the parameters are across time, temperature, etc and hopefully should be able to fix some of these issues through firmware (but certainly not all of them).

Best wishes,

Andrew

lpetre-ulb commented 5 years ago

Hi Andrew,

Thank you very much for your very detailed reply.

GBT does not phase align data. It has a similar fixed delay, and we do a hand-scan of phase values to find the window and then fuse a hard-coded sampling phase into the chip.

I had seen the possibility for the GBTX for automatically choose the correct phase in some slides. By more carefully reading the manual, I see that this method is not resistant to SEUs. Too bad...

S-bit "mapping" only creates a rotation of the S-bits so that 01234567 becomes 7123456. The TDC just uses the OR of the entire VFAT, correct? In which case you should be able to proceed as is, right?

Indeed, the actual version of the TDC uses the OR of an entire VFAT. This is how we made the first measurement with the v3 electronics (see this elog). You can notice that we still have a lot of improvement to do, both on the setup and on the detector configuration.

However, the final aim is to measure the time resolution with the full Sbit granularity, that is by using the "Sbits word" coming from the "Sbits cluster packer". Modifying the TDC module is not difficult, but it is nearly impossible to test it without the proper Sbit mapping.

Regarding the rotation of the Sbits, are you sure that it is not possible that one Sbit is not correctly associated to the correct BX ? If you look at the histogram slide 7 of this presentation, it looks like there are three peaks. The leftmost one is roughly separated from the main one by 25ns, that is 1 BX. While working on the v2a with an old firmware which did not time align the Sbits, I observed a similar behavior. The fix (for the timing measurement) was to OR the Sbits on the VFAT2 itself and use only 1 Sbit transmission line.

On the GEB v3c, you can see perhaps a timing issue in VFAT18, VFAT8, VFAT0, VFAT11, VFAT3 but all of these problems would not be an issue if you are just using the OR of the VFAT for timing measurement

VFAT14 has something else very wrong that could not be explained by timing or inversion.

One remark about this plot; I don't known if it is written somewhere, but on this GEBv3c plot posted by Brian, the firmware uses the v3b taps configuration. More precisely, this is the first TDC firmware, based on version 3.1.2B. So that configuration has mixed hardware/firmware. It might explain a behavior different than those observed on others GEBv3c.

We will be working on the timing question in the next few weeks and should have an idea soon how well things work, how consistent the parameters are across time, temperature, etc and hopefully should be able to fix some of these issues through firmware (but certainly not all of them).

Let me known if I can help you in any way with this issue. I also think we received one long GEB at ULB.

Best regards, Laurent

bdorney commented 5 years ago

However, the final aim is to measure the time resolution with the full Sbit granularity, that is by using the "Sbits word" coming from the "Sbits cluster packer". Modifying the TDC module is not difficult, but it is nearly impossible to test it without the proper Sbit mapping.

This is both alarmist and also not true. You are able to see which sbits are mapped correctly using the checkSbitMappingAndRate.py tool. For those vfats that have mismapped sbits mask them from the trigger block in the OH using the instructions here. This enables you to make tests of your FW module seamlessly.

It's possible I didn't understand the conversation above due to ignorance. But it should be explicitly clear that we will not make design choices to the optohybrid firmware just to accommodate this TDC module. If you are interested in working on solving this sbit mipmapping issue, which is a critical path issue for P5, please use the RPC module approach I've outlined above. Also since I think @andrewpeck has targeted this for his student you should discuss with him on how to contribute so we don't have two different people trying to solve the same problem (as that would be inefficient).

lpetre-ulb commented 5 years ago

However, the final aim is to measure the time resolution with the full Sbit granularity, that is by using the "Sbits word" coming from the "Sbits cluster packer". Modifying the TDC module is not difficult, but it is nearly impossible to test it without the proper Sbit mapping.

This is both alarmist and also not true. You are able to see which sbits are mapped correctly using the checkSbitMappingAndRate.py tool. For those vfats that have mismapped sbits mask them from the trigger block in the OH using the instructions here. This enables you to make tests of your FW module seamlessly.

Sure, I can mask the unused VFATs. However, there is only one position where it is possible to connect a VFAT on the GEM chamber at ULB and if the mapping of that VFAT is wrong, it won't help to mask it. And masking VFATs will not allow to test how the TDC behaves with the full detector : will slow controls sustain the acquisition rate ? won't the (little amount of) noise mask the signal since noise will be picked up from the full detector ? ...

It's possible I didn't understand the conversation above due to ignorance. But it should be explicitly clear that we will not make design choices to the optohybrid firmware just to accommodate this TDC module. If you are interested in working on solving this sbit mipmapping issue, which is a critical path issue for P5, please use the RPC module approach I've outlined above. Also since I think @andrewpeck has targeted this for his student you should discuss with him on how to contribute so we don't have two different people trying to solve the same problem (as that would be inefficient).

Of course, it is not to accommodate the TDC module. The conversation about was about the Sbit mapping issue due to bad Sbit timing parameter in all generality. Yes, it is best if we collaborate on fixing the issue; that is the meaning of the last sentence of my previous post.

andrewpeck commented 5 years ago

Indeed, the actual version of the TDC uses the OR of an entire VFAT. This is how we made the first measurement with the v3 electronics (see this elog). You can notice that we still have a lot of improvement to do, both on the setup and on the detector configuration.

However, the final aim is to measure the time resolution with the full Sbit granularity, that is by using the "Sbits word" coming from the "Sbits cluster packer". Modifying the TDC module is not difficult, but it is nearly impossible to test it without the proper Sbit mapping.

Regarding the rotation of the Sbits, are you sure that it is not possible that one Sbit is not correctly associated to the correct BX ? If you look at the histogram slide 7 of this presentation, it looks like there are three peaks. The leftmost one is roughly separated from the main one by 25ns, that is 1 BX. While working on the v2a with an old firmware which did not time align the Sbits, I observed a similar behavior. The fix (for the timing measurement) was to OR the Sbits on the VFAT2 itself and use only 1 Sbit transmission line.

Yes, the timing is a whole separate issue that will need to be addressed as well.

The bx is determined by the alignment of the SoT relative to the 40MHz clock.

But right now the 40MHz clock phase is completely arbitrary, so depending on the phase the S-bits will end up split randomly into different bunches. We need to phase shift the 40MHz clock (done on the GBTx) so center the data so that the S-bits that are supposed to be synchronous are falling in the same bx. Nobody has ever done this step (you are the first person besides me to even mention it... :(

One remark about this plot; I don't known if it is written somewhere, but on this GEBv3c plot posted by Brian, the firmware uses the v3b taps configuration. More precisely, this is the first TDC firmware, based on version 3.1.2B. So that configuration has mixed hardware/firmware. It might explain a behavior different than those observed on others GEBv3c.

Supposedly, based on the design files, the v3c and v3b should be the same, but this doesn't seem to be the case in reality :( So we need to figure it what it is supposed to be.. but naively on the 1st order they should be the same, to the best of our knowledge, hence why Brian was using the v3b config on v3c electronics.

Our student is starting today with getting things setup.. hopefully it won't take very long to get some working config for the v3c