Change DAC scan cut on chi2 to cut on chi2/ndof

AndrewLevin commented 4 years ago

The cut on the chisquared of the DAC scan fit before an exception is thrown is changed to a cut on the chisquared divided by number of degrees of freedom. The default value of the cut is set to 20.

Description

The cut on the value of chisquared of DAC vs ADC fits was too strict, and did not take into consideration that different DACs have different ranges.

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)

Motivation and Context

Partially resolves issue https://github.com/cms-gem-daq-project/gem-plotting-tools/issues/262

How Has This Been Tested?

Yes, I tested that non of the DAC scan fits in the example provided by @lpetre-ulb have chisquared/ndof larger than 10, then I added a safety factor of around 100%.

Screenshots (if appropriate):

Checklist:

[x] My code follows the code style of this project.
[ ] My change requires a change to the documentation.
[ ] I have updated the documentation accordingly.
[ ] I have added tests to cover my changes.
[ ] All new and existing tests passed.

lpetre-ulb commented 4 years ago

Thank you for the prompt PR. Have you tested tested that this new cut is still stringent enough to catch the bad DAC scan fits? In your previous PR you are writing that bad DAC scan fits have a chi² around 1400. If I assume your were talking about the CFG_HYST DAC, it would mean a chi²/NDF just above 21.

AndrewLevin commented 4 years ago

It is CFG_HYST, and the chisquared/ndof = 1457.5165/58 = 25.13

I have also checked the only other case that we are aware of (https://mattermost.web.cern.ch/cms-gem-ops/pl/xxspysnb3bgzzjkfozr8dwpwdh), and found that the chisquared/ndof value is even larger.

I have just updated the cut value to 15. Have you found any good-looking fits that exceed this?

lpetre-ulb commented 4 years ago

It is CFG_HYST, and the chisquared/ndof = 1457.5165/58 = 25.13

Right... the fifth order polynomial...

I have also checked the only other case that we are aware of (https://mattermost.web.cern.ch/cms-gem-ops/pl/xxspysnb3bgzzjkfozr8dwpwdh), and found that the chisquared/ndof value is even larger.

I have just updated the cut value to 15. Have you found any good-looking fits that exceed this?

Not in the few examples I've seen. I'll have 6 setups where the VFATs are know to work ready tomorrow; I could probably test your new cut on all of them at once in the evening.

lpetre-ulb commented 4 years ago

So, I just launched the DAC scan analysis on 5 setups (or 120 VFATs). A large χ² is rightfully reported for a few fits:

Warning: large chisquare for VFAT14 of chamber TestStand10 (Shelf1,Slot4,OH4) DAC CFG_BIAS_CFD_DAC_2.
Warning: large chisquare for VFAT14 of chamber TestStand10 (Shelf1,Slot4,OH4) DAC CFG_BIAS_PRE_I_BIT.
Warning: large chisquare for VFAT14 of chamber TestStand10 (Shelf1,Slot4,OH4) DAC CFG_BIAS_PRE_I_BLCC.
Warning: large chisquare for VFAT14 of chamber TestStand10 (Shelf1,Slot4,OH4) DAC CFG_BIAS_PRE_I_BSF.
Warning: large chisquare for VFAT11 of chamber TestStand10 (Shelf1,Slot4,OH0) DAC CFG_BIAS_SD_I_BDIFF.
Warning: large chisquare for VFAT14 of chamber TestStand10 (Shelf1,Slot4,OH4) DAC CFG_BIAS_SD_I_BDIFF.
Warning: large chisquare for VFAT14 of chamber TestStand10 (Shelf1,Slot4,OH4) DAC CFG_BIAS_SD_I_BFCAS.
Warning: large chisquare for VFAT14 of chamber TestStand10 (Shelf1,Slot4,OH4) DAC CFG_BIAS_SD_I_BSF.
Warning: large chisquare for VFAT14 of chamber TestStand10 (Shelf1,Slot4,OH4) DAC CFG_BIAS_SH_I_BDIFF.
Warning: large chisquare for VFAT16 of chamber TestStand10 (Shelf1,Slot4,OH10) DAC CFG_BIAS_SH_I_BDIFF.
Warning: large chisquare for VFAT14 of chamber TestStand10 (Shelf1,Slot4,OH4) DAC CFG_BIAS_SH_I_BFCAS.
Warning: large chisquare for VFAT14 of chamber TestStand10 (Shelf1,Slot4,OH4) DAC CFG_CAL_DAC.
Warning: large chisquare for VFAT14 of chamber TestStand10 (Shelf1,Slot4,OH4) DAC CFG_HYST.
[Excluding THR_ARM_DAC and THR_ZCC_DAC.]

The VFAT (1,4,4,14) clearly has an issue with the DAC scans (great, now we know it is a bad VFAT!)
The VFAT (1,4,0,11) is close to be well fitted.
The VFAT (1,4,10,16) is not very well fitted.

Canvas_1

In such cases, restricting the fit range really helps. I'm not sure the current criteria should be relaxed to allow this kind of (bad) fit.

Please find the root files here.

For the script to succeed, I also had to patch anautilities.py.

Is it expected?

Also, the chamber name in the warning messages in always the same. I think that detName should be updated in the loop lines 310 and that there is a typo (get**D**ame) in line 351: https://github.com/cms-gem-daq-project/gem-plotting-tools/blob/f38674504fd96af320e881a50b04e8564b260345/utils/anautilities.py#L351

More generally, why do we scan the whole DAC range then fit them for regular operation? I agree this is useful for QC and debugging. However, this is a slow process which doesn't allow to find the right value for the IREF DAC. That DAC could also change with the total integrated dose. Shouldn't we (also ?) have an RPC method which find the DAC working point as described in the VFAT3 manual for the IREF DAC (i.e. changing the value around the default one)? This procedure would be faster, would avoid any kind of fit failures and would allow to properly scan the IREF (and eventually the SLVS driver DACs).

jsturdy commented 4 years ago

More generally, why do we scan the whole DAC range then fit them for regular operation? I agree this is useful for QC and debugging. However, this is a slow process which doesn't allow to find the right value for the IREF DAC. That DAC could also change with the total integrated dose. Shouldn't we (also ?) have an RPC method which find the DAC working point as described in the VFAT3 manual for the IREF DAC (i.e. changing the value around the default one)? This procedure would be faster, would avoid any kind of fit failures and would allow to properly scan the IREF (and eventually the SLVS driver DACs).

I think the motivation for the current system is that these DAC scans aren't going to be done regularly, rather, only during QC and eventually as a way to monitor the performance over time (TID), and in these cases I think having the full curve is useful. At the same time, I have no problem making a routine that can give more granular control over the scan itself...

AndrewLevin commented 4 years ago

Sorry, I just fixed the three bugs @lpetre-ulb found.

I am not sure I would call the DAC scan and analysis a slow process. It takes like 7 minutes, right?

I would propose that we use the TGraphError's built-in Eval function to determine the register setting and we use the fit just as a mechanism to flag possibly bad VFATs for a human to look at.

Regarding the IREF setting, maybe we should ask the VFAT developers about this. My understanding is that now they provide us with an IREF setting in a calibration file, just like the ADC0 slope and intercept, but I don't understand why this register is different from the other ones and why we cannot or should not scan it also.

jsturdy commented 4 years ago

Regarding the IREF setting, maybe we should ask the VFAT developers about this. My understanding is that now they provide us with an IREF setting in a calibration file, just like the ADC0 slope and intercept, but I don't understand why this register is different from the other ones and why we cannot or should not scan it also.

If I recall correctly, it is because it is possible to write values that would cause damage to the ASIC itself. So we decided to not scan IREF (rather than simply only scanning within the valid range) (from email)

Regarding the comments yesterday you stated that when performing a dac scan on CFG_IREF the register should only be scanned within 0x20 and 0x32 (inclusive).

AndrewLevin commented 4 years ago

Regarding the IREF setting, maybe we should ask the VFAT developers about this. My understanding is that now they provide us with an IREF setting in a calibration file, just like the ADC0 slope and intercept, but I don't understand why this register is different from the other ones and why we cannot or should not scan it also.

If I recall correctly, it is because it is possible to write values that would cause damage to the ASIC itself. So we decided to not scan IREF (rather than simply only scanning within the valid range) (from email)

Regarding the comments yesterday you stated that when performing a dac scan on CFG_IREF the register should only be scanned within 0x20 and 0x32 (inclusive).

OK, then it seems to me like we should just scan the register in that range, so it is still not clear to me why they provide us with a value for the IREF register in the calibration file.

AndrewLevin commented 4 years ago

Why was this closed?

jsturdy commented 4 years ago

oops, i must have hit some keyboard shortcut :-(

lpetre-ulb commented 4 years ago

Sorry, I just fixed the three bugs @lpetre-ulb found.

Thanks! Is there a reason why you didn't also update the for-loop line 311 (missing detName update)?

https://github.com/cms-gem-daq-project/gem-plotting-tools/blob/75bb009970592a4b9b8e9f01674ef5f5f450dfd6/utils/anautilities.py#L311-L332

I am not sure I would call the DAC scan and analysis a slow process. It takes like 7 minutes, right?

Something like that, maybe a bit less. However, it cannot be parallelized and the tracking data scan developments will not help.

I would propose that we use the TGraphError's built-in Eval function to determine the register setting and we use the fit just as a mechanism to flag possibly bad VFATs for a human to look at.

:+1: The error could even be smaller than with the current method since the useful DAC range is very linear and the data points are very close to each other. The fit might report some false positives, but that's better than false negatives.

Regarding the IREF setting, maybe we should ask the VFAT developers about this. My understanding is that now they provide us with an IREF setting in a calibration file, just like the ADC0 slope and intercept, but I don't understand why this register is different from the other ones and why we cannot or should not scan it also.

Regarding the IREF setting, maybe we should ask the VFAT developers about this. My understanding is that now they provide us with an IREF setting in a calibration file, just like the ADC0 slope and intercept, but I don't understand why this register is different from the other ones and why we cannot or should not scan it also.

If I recall correctly, it is because it is possible to write values that would cause damage to the ASIC itself. So we decided to not scan IREF (rather than simply only scanning within the valid range) (from email)

Regarding the comments yesterday you stated that when performing a dac scan on CFG_IREF the register should only be scanned within 0x20 and 0x32 (inclusive).

OK, then it seems to me like we should just scan the register in that range, so it is still not clear to me why they provide us with a value for the IREF register in the calibration file.

In addition to damaging the ASIC, you can also loose the communication with the VFAT. I'm also not sure why the IREF value is provided; probably because it is crucial to set it right. You can find the calibration procedure in the VFAT3 manual, page 42.

From my understanding, only the VREF_ADC value and ADC0 calibration constants are absolutely required (ADC1 is probably not usable on GE1/1 because the external reference voltage changes with the FEAST and VFAT position). We should be able to re-calibrate all other DACs. And, if the DACs suffer from the TID, there is no reason why the IREF DAC wouldn't be affected.

I think the motivation for the current system is that these DAC scans aren't going to be done regularly, rather, only during QC and eventually as a way to monitor the performance over time (TID), and in these cases I think having the full curve is useful. At the same time, I have no problem making a routine that can give more granular control over the scan itself...

Okay. So, complete DAC scans will be performed on a regular basis. The only issue is that some DACs cannot really be scanned.... Therefore, we may need to have two methods (or a single, more generic, method):

The first method would perform a DAC scan when possible; all analysis is done offline and the data can be used for performance monitoring.
The second method would find (and set) the DAC value which correspond to the nominal current/voltage for all DACs; analysis must be done online.

AndrewLevin commented 4 years ago

Sorry, the last detName bug is fixed now.

I thought the DAC scan is already parallelized.

So, I will make a separate pull request to enable the scan of IREF in a restricted range. But I don't see any information about setting IREF on page 42 of this link:

https://espace.cern.ch/cms-project-GEMElectronics/VFAT3/VFAT3%20User%20Manual%20v2.2.pdf

So, the only DAC that cannot be scanned is VREF_ADC, right?

lpetre-ulb commented 4 years ago

Sorry, the last detName bug is fixed now.

Thank you!

I thought the DAC scan is already parallelized.

Well, the RPC method is indeed "mutilink". But... the DAC scans are internally serialized for each unmasked OH. It cannot be be truly parallelized since the VFAT slow control transactions can only be done one by one. I personally think that these kind of "multilink" scans should be deleted when migrating to the new framework; the name is only misleading and does not bring anything.

So, I will make a separate pull request to enable the scan of IREF in a restricted range. But I don't see any information about setting IREF on page 42 of this link:

https://espace.cern.ch/cms-project-GEMElectronics/VFAT3/VFAT3%20User%20Manual%20v2.2.pdf

It's on page 42 of the manual, so page 60 of the PDF. The calibration procedure does not scan a restricted range, but iterates around the default DAC value until the nominal current is reached.

Since, this "scan" is not currently required for the QC steps, I would implement it directly in the new framework. The RPC method would take the DAC to be calibrated and the ADC calibration constants as inputs are return the found working point for that DAC. If it cannot be found, an exception can be thrown. Of course, the ranges should be restricted to keep the ASICs safe.

It is probably worth opening an issue in the ctp7_modules and have a discussion with the VFAT3 designers though.

So, the only DAC that cannot be scanned is VREF_ADC, right?

Yes. It could be scanned, but then the calibration constants for the ADC would be different and we don't have any possibility to re-calibrate it (requires external tools).

jsturdy commented 4 years ago

I thought the DAC scan is already parallelized.

Nothing that is done in software is truly "parallelized", simply because it requires reading/writing a specific register. What can probably be done, is a reoptimization of the loops for the "multi-link" mode, as it currently just loops over the links and performs the dacScanLocal function, in which there is a 1s sleep for every DAC... I've mentioned this to @bregnery, but for anyone who wants to look at how the loops are done, and figure out the optimization, that's something we should address in the post templated framework (not before)

So, I will make a separate pull request to enable the scan of IREF in a restricted range. But I don't see any information about setting IREF on page 42 of this link:

https://espace.cern.ch/cms-project-GEMElectronics/VFAT3/VFAT3%20User%20Manual%20v2.2.pdf

On page 42, the procedure is described, but it cannot be performed in situ in a closed chamber... There was also a promise from the VFAT3 team that the manual would be updated, but as yet it hasn't happened. We could ping Aamir or Franceso to see if this is still in the works, or if it has been done (though they may be waiting on the results from the irradiation studies)

So, the only DAC that cannot be scanned is VREF_ADC, right?

See page 43 for a description on how to tune ADC0 (probably this is the value that is provided in the DB)

AndrewLevin commented 4 years ago

Apparently there are two page 42s of the VFAT3 manual, one on page 44 of the pdf and one on page 60 of the pdf...

I think probably we can already merge this pull request since it is at least an improvement from what we have now, and continue the IREF discussion elsewhere, and I will make a separate pull request to use the TGraphError's evaluation.

lpetre-ulb commented 4 years ago

I thought the DAC scan is already parallelized.

Nothing that is done in software is truly "parallelized", simply because it requires reading/writing a specific register. What can probably be done, is a reoptimization of the loops for the "multi-link" mode, as it currently just loops over the links and performs the dacScanLocal function, in which there is a 1s sleep for every DAC... I've mentioned this to @bregnery, but for anyone who wants to look at how the loops are done, and figure out the optimization, that's something we should address in the post templated framework (not before)

Right, it should be possible to merge the 1s sleep, but I'm not convinced that it will be a real improvement. The scans are intrinsically much longer.

Also, the 1s sleep should probably be completely removed. It is only there because this RPC method switches the VFATs into run mode at the begin and switches them out of run mode at the end. I believe this should be done somewhere else; the RPC method shouldn't "turn on/off" the VFATs in an uncontrolled/unconsistent way.

Apparently there are two page 42s of the VFAT3 manual, one on page 44 of the pdf and one on page 60 of the pdf...

... that manual... we really need a new version...

I think probably we can already merge this pull request since it is at least an improvement from what we have now, and continue the IREF discussion elsewhere, and I will make a separate pull request to use the TGraphError's evaluation.

:+1:

jsturdy commented 4 years ago

may as well rebase while you're at it

cms-gem-daq-project / gem-plotting-tools