CMS-HGCAL / reco-ntuples

Home of the Ntuplizer for the HGCAL reconstruction software studies
6 stars 36 forks source link

Missing layer clusters (halo hits) #43

Closed artlbv closed 7 years ago

artlbv commented 7 years ago

This is to illustrate the issue of sporadically missing layer clusters.

It was observed that good multiclusters can sometimes miss one (or several) layer clusters even close to the shower maximum.

I'll try to find this behaviour in the cmssw execution and post an update later.

artlbv commented 7 years ago

Ok, I have an idea related to the nearestHigher variable: If the nearest higher rho has a greater, but not greater equal comparison, then two nearby hits with the same rho will result in two separate cluster centers (seeds). If both hits are adjacent then both centers will be labeled border hits. Since the centers have the highest rho in the cluster, all hits will become halo hits. This would explain the case shown in these plots.

I'll check the definition of nearestHigher and post the conclusions here.

ZihengChen commented 7 years ago

When I did image algorithm on MNIST dataset, I found "same rho" is a big issue in a similar but slightly different way.

If two hits satisfy the two conditions:

Image algorithm will make two separate seeds for the two hits (even though they are adjacent). In that case, the same as Artur said, these two will be flagged as border and halo, setting up a very high threshold for core that barely other hits can reach.

In order to avoid this, in the density calculation, I apply a gaussian kernel instead of direct summation.

ZihengChen commented 7 years ago

I am putting on more piece of information for debug. The threshold of hexagon cell. skippinglayer

artlbv commented 7 years ago

Could you clarify how you calculate rho with the Gaussian kernel? Is that implemented in the python code? This could explain the difference, i.e. that there is no missing layer in you plot.

By the way, it seems this is the part where nearestHigher is calculated: https://github.com/cms-sw/cmssw/blob/CMSSW_9_3_X/RecoLocalCalo/HGCalRecAlgos/src/HGCalImagingAlgo.cc#L273 It looks as the >= should avoid the same rho problem, but it requires a deeper look to confirm it. Maybe @clelange can comment on this?

ZihengChen commented 7 years ago

in the paper of image algorithm on page 1495. They talked about two alternative kernels used to calculate rho: exponential kernel and gaussian kernel. It is simply define X(d-d_c) in rho calculation as exponential function or gaussian function instead of square function in equation (1)

screen shot 2017-08-31 at 9 38 16 pm screen shot 2017-08-31 at 9 37 49 pm
ZihengChen commented 7 years ago

Actually when doing CMSSW-python comparison in my plot, the kernel in python is the original square kernel as in equation (1) I used gaussian kernel only for clustering MNIST dataset.

clelange commented 7 years ago

By the way, it seems this is the part where nearestHigher is calculated: https://github.com/cms-sw/cmssw/blob/CMSSW_9_3_X/RecoLocalCalo/HGCalRecAlgos/src/HGCalImagingAlgo.cc#L273 It looks as the >= should avoid the same rho problem, but it requires a deeper look to confirm it. Maybe @clelange can comment on this?

Hi @artlbv, I thought you were already on it adding cout statements. If not, I can probably have a look tomorrow. Let me know.

artlbv commented 7 years ago

@clelange, I've actually confused you as the author of these lines, but it appears to be Lindsey, so I'll have a look myself, no problem.

artlbv commented 7 years ago

I believe I have finally understood this issue. The reason for the clusters to be completely halo is similar to the one I've described above, but is not due to some bug in the algorithm, but rather to the defined critical distance (2cm).

image

Therefore, the reason for this behaviour is that the critical distance is defined at 2cm, which does not necessarily cover all the closest hits (i.e. in the second "ring" around a hit).

Further, I'll try to estimate the rate of the present effect and then check it with an increased dc (~2.3).

artlbv commented 7 years ago

Illustration of this case

Here is a figure to illustrate this case using the hexagons (the colours are only for illustration). I am assuming a 1cm hexagon size, i.e. for the >=200um sensors.

image

Generic example

This figure is to illustrate the general "problem" of a 2cm delta_c.

As another consequence, rho is calculated from all the cells within the circle, except the red ones.

image

NB! this concerns the cells with 1cm2 size, in case of the central 100um sensors, where the cells are smaller (0.5cm2), the d_c coverage is larger in terms of cells.

ZihengChen commented 7 years ago

HI Artur I think this makes a lot of sense. My understanding is on your first plot, the seeds are two orange circles which separates larger than dc, but the two boxes from neighboring clusterings are within dc of seeds and therefore make the seeds as border/halo hits. Is that right? I need to check how could the python ntuple-tool avoid such situations.

clelange commented 7 years ago

That makes me think that a sensible minimum d_c should be based on the cell size and not simply on the subdetector...

ZihengChen commented 7 years ago

I think we can use an exponential kernel to calculate the density rho. This will make the highest density sit on top of the most energetic cell. Then we can somehow avoid this case to some extent.

ZihengChen commented 7 years ago

And making dc~2.24 would also put the highest density sit on the most energetic cell in the center. Thus avoid this problem to some extent.

artlbv commented 7 years ago

Let me link here the slides from yesterday's presentation: indico

The only plot now shown here is the one illustrating the distances in the two different cell sizes. I calculated these numbers based on the cell areas, and they do agree fairly well with the values from the rechits in the samples. image

Now, the problem with increasing the dc to 2.24 is that this might affect the smaller cells. In order to include 2 rings around the cell and not the third ring the dc has to satisfy: 2d < d_c < 2.65d. In numbers this means (based on the rechits' distance):

Unfortunately, these two ranges do not overlap, and therefore there is no way to define a consistent dc for both sensor types. I.e. choosing dc = 2.3 would cause a similar issue with ring 3 in the 100um cells.

In order to go on with this issue, I support Chris' suggestion to ignore the core/halo definitions for now and calculate the cluster position and energy based on all clustered rechits in vicinity of the max-energy hit.

@clelange, do you think this could be implemented already in the ImagingAlgo? One would need to add a step to find the max energy cell and then use the distance to it in the following. Another suggestion: recycle the halo/core definition using the distance to the max energy hit. This would probably require the least modifications to the code: just adding a function that resets the isHalo bit based on the distance to the max energy hit.

artlbv commented 7 years ago

p.s. @ZihengChen I'm also curious why this did not show up in the standalone – maybe there's another discrepancy? I remember you saying that in some cases CMSSW did not even create a layer cluster, while the python code did. But I've not seen such cases of really missing layer clusters yet.

ZihengChen commented 7 years ago

I said CMSSW did not even create a layer cluster, because in the ntuples of CMSSW, on the skipped layer, rechits_cluster2d are all default -1, which means rechits are not assigned to any cluster2d. Also the cluster2d collections is an empty list.

ZihengChen commented 7 years ago

cluster2d_layer10_same

Rechits are from CMSSW ntuple, its color indicates the assigned 2dcluster of a rechit by CMSSW and are all -1, the default value.

Based these rechits, standalone makes layerclusters (black frames) while CMSSW does not (cluster2d is empty).

artlbv commented 7 years ago

This seems indeed strange, since -1 actually means that the rechits were not even considered for 2d clustering, i.e. did not pass the selection threshold (I think it is only the sigmaNoise requirement). Could you please give the sample, run and event number for this event? And also the cluster id or coordinates.

ZihengChen commented 7 years ago

Yes. that is right. If at least one hit is not -1 and get clustered, the layercluster collection will not be an empty. It will be great if you can have a look at this layer. Its information is the following

ntuple directory:

/eos/cms/store/cmst3/group/hgcal/CMG_studies/Production/_RelValSingleElectronPt35Extended_CMSSW_9_3_0_pre4-93X_upgrade2023_realistic_v0_2023D17noPU-v1_GEN-SIM-RECO/NTUP/

ntuple file:

_RelValSingleElectronPt35Extended_CMSSW_9_3_0_pre4-93X_upgrade2023_realistic_v0_2023D17noPU-v1_GEN-SIM-RECO_NTUP_1.root

event:

event number 2252, the 52th record in the ntuple.

missing layer:

the +10th layer, on the z>0 side of the detector.

Please let me know anything you see if you are looking at that layer.

ZihengChen commented 7 years ago

density This was what I saw.

artlbv commented 7 years ago

Thanks @ZihengChen, I've found the event in the ntuple, but I'll need to rerun the ntupler to debug the clustering. Meanwhile, I've added the possibility to use halo hits (close to the axis) in the PCA and shower shape computation. The results seem to have improved slightly, but no dramatic changes. See #48.

artlbv commented 7 years ago

I've put a summary and solution of this issue in these slides: https://indico.cern.ch/event/665253/contributions/2719409/attachments/1521785/2377702/hgc_halo_issue_extended.pdf