Closed artlbv closed 7 years ago
Ok, I have an idea related to the nearestHigher variable: If the nearest higher rho has a greater, but not greater equal comparison, then two nearby hits with the same rho will result in two separate cluster centers (seeds). If both hits are adjacent then both centers will be labeled border hits. Since the centers have the highest rho in the cluster, all hits will become halo hits. This would explain the case shown in these plots.
I'll check the definition of nearestHigher and post the conclusions here.
When I did image algorithm on MNIST dataset, I found "same rho" is a big issue in a similar but slightly different way.
If two hits satisfy the two conditions:
Image algorithm will make two separate seeds for the two hits (even though they are adjacent). In that case, the same as Artur said, these two will be flagged as border and halo, setting up a very high threshold for core that barely other hits can reach.
In order to avoid this, in the density calculation, I apply a gaussian kernel instead of direct summation.
I am putting on more piece of information for debug. The threshold of hexagon cell.
Could you clarify how you calculate rho with the Gaussian kernel? Is that implemented in the python code? This could explain the difference, i.e. that there is no missing layer in you plot.
By the way, it seems this is the part where nearestHigher is calculated: https://github.com/cms-sw/cmssw/blob/CMSSW_9_3_X/RecoLocalCalo/HGCalRecAlgos/src/HGCalImagingAlgo.cc#L273 It looks as the >= should avoid the same rho problem, but it requires a deeper look to confirm it. Maybe @clelange can comment on this?
in the paper of image algorithm on page 1495. They talked about two alternative kernels used to calculate rho: exponential kernel and gaussian kernel. It is simply define X(d-d_c) in rho calculation as exponential function or gaussian function instead of square function in equation (1)
Actually when doing CMSSW-python comparison in my plot, the kernel in python is the original square kernel as in equation (1) I used gaussian kernel only for clustering MNIST dataset.
By the way, it seems this is the part where nearestHigher is calculated: https://github.com/cms-sw/cmssw/blob/CMSSW_9_3_X/RecoLocalCalo/HGCalRecAlgos/src/HGCalImagingAlgo.cc#L273 It looks as the >= should avoid the same rho problem, but it requires a deeper look to confirm it. Maybe @clelange can comment on this?
Hi @artlbv, I thought you were already on it adding cout statements. If not, I can probably have a look tomorrow. Let me know.
@clelange, I've actually confused you as the author of these lines, but it appears to be Lindsey, so I'll have a look myself, no problem.
I believe I have finally understood this issue. The reason for the clusters to be completely halo is similar to the one I've described above, but is not due to some bug in the algorithm, but rather to the defined critical distance (2cm).
The distance between the seeds is 2.24 cm, which is slightly above the d_c of 2cm and, therefore, both seeds form independent clusters.
Since the two cluster centres each have hits from the other cluster nearby (<2cm), the seed hits become border hits.
Since the cluster centres are border hits, all other hits within the cluster also become border hits.
Therefore, the reason for this behaviour is that the critical distance is defined at 2cm, which does not necessarily cover all the closest hits (i.e. in the second "ring" around a hit).
Further, I'll try to estimate the rate of the present effect and then check it with an increased dc (~2.3).
Here is a figure to illustrate this case using the hexagons (the colours are only for illustration). I am assuming a 1cm hexagon size, i.e. for the >=200um sensors.
This figure is to illustrate the general "problem" of a 2cm delta_c.
As another consequence, rho is calculated from all the cells within the circle, except the red ones.
NB! this concerns the cells with 1cm2 size, in case of the central 100um sensors, where the cells are smaller (0.5cm2), the d_c coverage is larger in terms of cells.
HI Artur I think this makes a lot of sense. My understanding is on your first plot, the seeds are two orange circles which separates larger than dc, but the two boxes from neighboring clusterings are within dc of seeds and therefore make the seeds as border/halo hits. Is that right? I need to check how could the python ntuple-tool avoid such situations.
That makes me think that a sensible minimum d_c
should be based on the cell size and not simply on the subdetector...
I think we can use an exponential kernel to calculate the density rho. This will make the highest density sit on top of the most energetic cell. Then we can somehow avoid this case to some extent.
And making dc~2.24 would also put the highest density sit on the most energetic cell in the center. Thus avoid this problem to some extent.
Let me link here the slides from yesterday's presentation: indico
The only plot now shown here is the one illustrating the distances in the two different cell sizes. I calculated these numbers based on the cell areas, and they do agree fairly well with the values from the rechits in the samples.
Now, the problem with increasing the dc to 2.24 is that this might affect the smaller cells. In order to include 2 rings around the cell and not the third ring the dc has to satisfy: 2d < d_c < 2.65d
.
In numbers this means (based on the rechits' distance):
Unfortunately, these two ranges do not overlap, and therefore there is no way to define a consistent dc for both sensor types. I.e. choosing dc = 2.3 would cause a similar issue with ring 3 in the 100um cells.
In order to go on with this issue, I support Chris' suggestion to ignore the core/halo definitions for now and calculate the cluster position and energy based on all clustered rechits in vicinity of the max-energy hit.
@clelange, do you think this could be implemented already in the ImagingAlgo? One would need to add a step to find the max energy cell and then use the distance to it in the following.
Another suggestion: recycle the halo/core definition using the distance to the max energy hit. This would probably require the least modifications to the code: just adding a function that resets the isHalo
bit based on the distance to the max energy hit.
p.s. @ZihengChen I'm also curious why this did not show up in the standalone – maybe there's another discrepancy? I remember you saying that in some cases CMSSW did not even create a layer cluster, while the python code did. But I've not seen such cases of really missing layer clusters yet.
I said CMSSW did not even create a layer cluster, because in the ntuples of CMSSW, on the skipped layer, rechits_cluster2d
are all default -1
, which means rechits are not assigned to any cluster2d. Also the cluster2d collections is an empty list.
Rechits are from CMSSW ntuple, its color indicates the assigned 2dcluster of a rechit by CMSSW and are all -1, the default value.
Based these rechits, standalone makes layerclusters (black frames) while CMSSW does not (cluster2d is empty).
This seems indeed strange, since -1 actually means that the rechits were not even considered for 2d clustering, i.e. did not pass the selection threshold (I think it is only the sigmaNoise requirement). Could you please give the sample, run and event number for this event? And also the cluster id or coordinates.
Yes. that is right. If at least one hit is not -1 and get clustered, the layercluster collection will not be an empty. It will be great if you can have a look at this layer. Its information is the following
ntuple directory:
/eos/cms/store/cmst3/group/hgcal/CMG_studies/Production/_RelValSingleElectronPt35Extended_CMSSW_9_3_0_pre4-93X_upgrade2023_realistic_v0_2023D17noPU-v1_GEN-SIM-RECO/NTUP/
ntuple file:
_RelValSingleElectronPt35Extended_CMSSW_9_3_0_pre4-93X_upgrade2023_realistic_v0_2023D17noPU-v1_GEN-SIM-RECO_NTUP_1.root
event:
event number 2252, the 52th record in the ntuple.
missing layer:
the +10th layer, on the z>0 side of the detector.
Please let me know anything you see if you are looking at that layer.
This was what I saw.
Thanks @ZihengChen, I've found the event in the ntuple, but I'll need to rerun the ntupler to debug the clustering. Meanwhile, I've added the possibility to use halo hits (close to the axis) in the PCA and shower shape computation. The results seem to have improved slightly, but no dramatic changes. See #48.
I've put a summary and solution of this issue in these slides: https://indico.cern.ch/event/665253/contributions/2719409/attachments/1521785/2377702/hgc_halo_issue_extended.pdf
This is to illustrate the issue of sporadically missing layer clusters.
It was observed that good multiclusters can sometimes miss one (or several) layer clusters even close to the shower maximum.
This plot shows all rechits associated to any multicluster for a single 25 GeV photon. The color coding reflects the rechit_flag: 0 stands for core hits, while 2 are halo hits. It is clearly visible that in layer 9 most rechits are halo hits.
This plot shows the same event as the previous but only for layer 9. There are only 2 core hits.
In this plot the colour of the rechits corresponds to rechit energy. Most energy is deposited in the central hits.
In this plot the colour of the rechits corresponds to the associated 2d cluster index. There are multiple 2d clusters, while only 1 has core hits.
In this plot the colour of the rechits corresponds to the associated multi cluster index. The central hits (light blue colour) correspond to a multicluster with ~22 GeV, i.e. the one matching the photon. Since these hits are all halo hits, their energy (~6GeV) is also not taken into account in the total (~90GeV).
In fact, the proper hits are assigned to two adjacent layer clusters. The color reflects the 2d cluster index. The boxes show the cluster seeds, and they appear to be adjacent cells. This would mean, that both hits pass the delta_c cut, but at least one of them should have a delta of 1cm – the cell distance.
I'll try to find this behaviour in the cmssw execution and post an update later.