Missing layer clusters (halo hits)

artlbv commented 7 years ago

This is to illustrate the issue of sporadically missing layer clusters.

It was observed that good multiclusters can sometimes miss one (or several) layer clusters even close to the shower maximum.

This plot shows all rechits associated to any multicluster for a single 25 GeV photon. The color coding reflects the rechit_flag: 0 stands for core hits, while 2 are halo hits. It is clearly visible that in layer 9 most rechits are halo hits.
This plot shows the same event as the previous but only for layer 9. There are only 2 core hits.
In this plot the colour of the rechits corresponds to rechit energy. Most energy is deposited in the central hits.
In this plot the colour of the rechits corresponds to the associated 2d cluster index. There are multiple 2d clusters, while only 1 has core hits.
In this plot the colour of the rechits corresponds to the associated multi cluster index. The central hits (light blue colour) correspond to a multicluster with ~22 GeV, i.e. the one matching the photon. Since these hits are all halo hits, their energy (~6GeV) is also not taken into account in the total (~90GeV).
In fact, the proper hits are assigned to two adjacent layer clusters. The color reflects the 2d cluster index. The boxes show the cluster seeds, and they appear to be adjacent cells. This would mean, that both hits pass the delta_c cut, but at least one of them should have a delta of 1cm – the cell distance.

I'll try to find this behaviour in the cmssw execution and post an update later.

artlbv commented 7 years ago

Ok, I have an idea related to the nearestHigher variable: If the nearest higher rho has a greater, but not greater equal comparison, then two nearby hits with the same rho will result in two separate cluster centers (seeds). If both hits are adjacent then both centers will be labeled border hits. Since the centers have the highest rho in the cluster, all hits will become halo hits. This would explain the case shown in these plots.

I'll check the definition of nearestHigher and post the conclusions here.

ZihengChen commented 7 years ago

When I did image algorithm on MNIST dataset, I found "same rho" is a big issue in a similar but slightly different way.

If two hits satisfy the two conditions:

adjacent
equally the highest rho

Image algorithm will make two separate seeds for the two hits (even though they are adjacent). In that case, the same as Artur said, these two will be flagged as border and halo, setting up a very high threshold for core that barely other hits can reach.

In order to avoid this, in the density calculation, I apply a gaussian kernel instead of direct summation.

ZihengChen commented 7 years ago

I am putting on more piece of information for debug. The threshold of hexagon cell. skippinglayer

artlbv commented 7 years ago

Could you clarify how you calculate rho with the Gaussian kernel? Is that implemented in the python code? This could explain the difference, i.e. that there is no missing layer in you plot.

By the way, it seems this is the part where nearestHigher is calculated: https://github.com/cms-sw/cmssw/blob/CMSSW_9_3_X/RecoLocalCalo/HGCalRecAlgos/src/HGCalImagingAlgo.cc#L273 It looks as the >= should avoid the same rho problem, but it requires a deeper look to confirm it. Maybe @clelange can comment on this?

ZihengChen commented 7 years ago

in the paper of image algorithm on page 1495. They talked about two alternative kernels used to calculate rho: exponential kernel and gaussian kernel. It is simply define X(d-d_c) in rho calculation as exponential function or gaussian function instead of square function in equation (1)

ZihengChen commented 7 years ago

Actually when doing CMSSW-python comparison in my plot, the kernel in python is the original square kernel as in equation (1) I used gaussian kernel only for clustering MNIST dataset.

clelange commented 7 years ago

By the way, it seems this is the part where nearestHigher is calculated: https://github.com/cms-sw/cmssw/blob/CMSSW_9_3_X/RecoLocalCalo/HGCalRecAlgos/src/HGCalImagingAlgo.cc#L273 It looks as the >= should avoid the same rho problem, but it requires a deeper look to confirm it. Maybe @clelange can comment on this?

Hi @artlbv, I thought you were already on it adding cout statements. If not, I can probably have a look tomorrow. Let me know.

artlbv commented 7 years ago

@clelange, I've actually confused you as the author of these lines, but it appears to be Lindsey, so I'll have a look myself, no problem.

artlbv commented 7 years ago

I believe I have finally understood this issue. The reason for the clusters to be completely halo is similar to the one I've described above, but is not due to some bug in the algorithm, but rather to the defined critical distance (2cm).

First, in my last plot of the rechits, the cluster seeds are defined by the ntupler as the highest energy hits in the layer cluster, not the maximum rho ones. I'm attaching the plot with the correct cluster seeds indicated by orange circles.

The distance between the seeds is 2.24 cm, which is slightly above the d_c of 2cm and, therefore, both seeds form independent clusters.
Since the two cluster centres each have hits from the other cluster nearby (<2cm), the seed hits become border hits.
Since the cluster centres are border hits, all other hits within the cluster also become border hits.

Therefore, the reason for this behaviour is that the critical distance is defined at 2cm, which does not necessarily cover all the closest hits (i.e. in the second "ring" around a hit).

Further, I'll try to estimate the rate of the present effect and then check it with an increased dc (~2.3).

artlbv commented 7 years ago

Illustration of this case

Here is a figure to illustrate this case using the hexagons (the colours are only for illustration). I am assuming a 1cm hexagon size, i.e. for the >=200um sensors.

The yellow cell is the highest-energy cell, while the red and black hexagons mark the two hits with the highest rho – the cluster seeds.
The black circle shows the critical distance (d_c) of 2cm
The red arrow is the distance between the cluster seeds
Since the red arrow is longer than the d_c, both seeds become cluster centres The rest is described above.

Generic example

This figure is to illustrate the general "problem" of a 2cm delta_c.

The black circle shows the 2cm critical distance around the central (yellow) cell
The outer black hexagons have their centres within this 2cm radius
But the red hexagon centres are ~2.24 away from the central cell and therefore outside of the circle

As another consequence, rho is calculated from all the cells within the circle, except the red ones.

NB! this concerns the cells with 1cm2 size, in case of the central 100um sensors, where the cells are smaller (0.5cm2), the d_c coverage is larger in terms of cells.

ZihengChen commented 7 years ago

HI Artur I think this makes a lot of sense. My understanding is on your first plot, the seeds are two orange circles which separates larger than dc, but the two boxes from neighboring clusterings are within dc of seeds and therefore make the seeds as border/halo hits. Is that right? I need to check how could the python ntuple-tool avoid such situations.

clelange commented 7 years ago

That makes me think that a sensible minimum d_c should be based on the cell size and not simply on the subdetector...

ZihengChen commented 7 years ago

I think we can use an exponential kernel to calculate the density rho. This will make the highest density sit on top of the most energetic cell. Then we can somehow avoid this case to some extent.

ZihengChen commented 7 years ago

And making dc~2.24 would also put the highest density sit on the most energetic cell in the center. Thus avoid this problem to some extent.

artlbv commented 7 years ago

Let me link here the slides from yesterday's presentation: indico

The only plot now shown here is the one illustrating the distances in the two different cell sizes. I calculated these numbers based on the cell areas, and they do agree fairly well with the values from the rechits in the samples.

Now, the problem with increasing the dc to 2.24 is that this might affect the smaller cells. In order to include 2 rings around the cell and not the third ring the dc has to satisfy: 2d < d_c < 2.65d. In numbers this means (based on the rechits' distance):

0.5cm2 cells: 1.65 < dc < 2.18
1cm2 cells: 2.25 < dc < 2.97

Unfortunately, these two ranges do not overlap, and therefore there is no way to define a consistent dc for both sensor types. I.e. choosing dc = 2.3 would cause a similar issue with ring 3 in the 100um cells.

In order to go on with this issue, I support Chris' suggestion to ignore the core/halo definitions for now and calculate the cluster position and energy based on all clustered rechits in vicinity of the max-energy hit.

@clelange, do you think this could be implemented already in the ImagingAlgo? One would need to add a step to find the max energy cell and then use the distance to it in the following. Another suggestion: recycle the halo/core definition using the distance to the max energy hit. This would probably require the least modifications to the code: just adding a function that resets the isHalo bit based on the distance to the max energy hit.

artlbv commented 7 years ago

p.s. @ZihengChen I'm also curious why this did not show up in the standalone – maybe there's another discrepancy? I remember you saying that in some cases CMSSW did not even create a layer cluster, while the python code did. But I've not seen such cases of really missing layer clusters yet.

ZihengChen commented 7 years ago

I said CMSSW did not even create a layer cluster, because in the ntuples of CMSSW, on the skipped layer, rechits_cluster2d are all default -1, which means rechits are not assigned to any cluster2d. Also the cluster2d collections is an empty list.

ZihengChen commented 7 years ago

cluster2d_layer10_same

Rechits are from CMSSW ntuple, its color indicates the assigned 2dcluster of a rechit by CMSSW and are all -1, the default value.

Based these rechits, standalone makes layerclusters (black frames) while CMSSW does not (cluster2d is empty).

artlbv commented 7 years ago

This seems indeed strange, since -1 actually means that the rechits were not even considered for 2d clustering, i.e. did not pass the selection threshold (I think it is only the sigmaNoise requirement). Could you please give the sample, run and event number for this event? And also the cluster id or coordinates.

ZihengChen commented 7 years ago

Yes. that is right. If at least one hit is not -1 and get clustered, the layercluster collection will not be an empty. It will be great if you can have a look at this layer. Its information is the following

ntuple directory:

/eos/cms/store/cmst3/group/hgcal/CMG_studies/Production/_RelValSingleElectronPt35Extended_CMSSW_9_3_0_pre4-93X_upgrade2023_realistic_v0_2023D17noPU-v1_GEN-SIM-RECO/NTUP/

ntuple file:

_RelValSingleElectronPt35Extended_CMSSW_9_3_0_pre4-93X_upgrade2023_realistic_v0_2023D17noPU-v1_GEN-SIM-RECO_NTUP_1.root

event:

event number 2252, the 52th record in the ntuple.

missing layer:

the +10th layer, on the z>0 side of the detector.

Please let me know anything you see if you are looking at that layer.

ZihengChen commented 7 years ago

density This was what I saw.

artlbv commented 7 years ago

Thanks @ZihengChen, I've found the event in the ntuple, but I'll need to rerun the ntupler to debug the clustering. Meanwhile, I've added the possibility to use halo hits (close to the axis) in the PCA and shower shape computation. The results seem to have improved slightly, but no dramatic changes. See #48.

artlbv commented 7 years ago

I've put a summary and solution of this issue in these slides: https://indico.cern.ch/event/665253/contributions/2719409/attachments/1521785/2377702/hgc_halo_issue_extended.pdf

CMS-HGCAL / reco-ntuples

Missing layer clusters (halo hits) #43

Illustration of this case

Generic example