XENON1T / pax

The XENON1T raw data processor [deprecated]
BSD 3-Clause "New" or "Revised" License
16 stars 15 forks source link

First look at Kr83m with PAX - Improvement of clustering needed #158

Closed alexfieg closed 9 years ago

alexfieg commented 9 years ago

Hey,

I hope opening an issue for this is the right way for discussion. If not move this please or tell me where to move ;) With a lot of help of Bart ( and Dominik from MPIK) I could start looking into samples of Kr83m data taken at Xe100 in April this year. The data we discussing here is a part from xe100_150429_0515, which I analyzed a bit with XeRawDP and wanted to compare with PAX. Just as a short reminder: Kr83m decays with 2(!) steps, therefore it produces 2 S1-Signals with a half-life of ~154ns. The resolution, based on the separation of this two signals, is given by the peakfinder and ultimately by the DAQ itself. While XeRawDP is somewhere resolving two S1 signals around 200 ns, I want to help that PAX performs better than this ( I know this a special analysis, not so much related to dark matter search, but it gives a good handle on S1-treatment in a first step).

Ok lets come to the method and results. I processed 305 .XED files of one dataset with the XED.ini and use the hdf5 output. First of all I defined myself a second field to distinguish the different s1s by there area ( main_peaks & secondary_peaks)

    main_peaks = []
    secondary_peaks =[]
    for pks in tqdm(peaks_per_event, desc='Selecting %ss' % pt.decode()):
        peaks_of_this_type =  pks[pks['type'] == pt]
        main_peaks.append(peaks_of_this_type[np.argmax(peaks_of_this_type['area'])])
        peaks_of_this_type=peaks_of_this_type[peaks_of_this_type["area"]<peaks_of_this_type["area"].max()]
        secondary_peaks.append(peaks_of_this_type[np.argmax(peaks_of_this_type['area'])])     

    main_peaks = np.array(main_peaks)
    assert len(main_peaks) == len(peaks_per_event)
    secondary_peaks = np.array(secondary_peaks)
    assert len(secondary_peaks) == len(peaks_per_event)

    for fn in main_peaks.dtype.names:
        if fn in ignore_fields:
            continue
        events = append_fields(events,
                               "%s_%s" % (pt.decode(), fn),
                               main_peaks[fn])

    for fn in secondary_peaks.dtype.names:
        if fn in ignore_fields:
            continue
        events = append_fields(events,
                               "%s_1_%s" % (pt.decode(), fn),
                               secondary_peaks[fn])

Following this I defined a time separation of this two peaks

events = append_fields(events, 'sep_time', events['s1_1_hit_time_mean'] - events['s1_hit_time_mean'])

If I demand that the second s1 follows the first s1 by applying a simple cut

(events['s1_hit_time_mean'] < events['s1_1_hit_time_mean']) 

the following distribution is plotted for 55000 of originally 305 000 events :

s1dt

As you can see the separation is quite bad, since it begins only at 500 ns(*). Therefore I checked a couple of wave forms and discovered for e.g this one:

s1_separation_pax500ns_zoom_4

Here you have a clear Kr83m-event with two separate S1 signals, but PAX identifies them as only 1 big S1 signal. For me it looks like there is a 500 ns window where PAX clusters and identifies a S1 and does have problems to recognize the drop in between. But here you can tell me more how this works I guess. As far as I could see, since the clustering is done before identification and it is optimized somehow on S2, which could cause this problems? ( I will have a deeper look later into the .ini file) Additionally, I found also some "unknown" tagged peaks which are most likely two s1 signals, which have been seen as one but exaggerated the 500 ns window, e.g

s1_unknown_missinterpre_pax

So this results basically from the same reason as the misidentification before.

Taking this into account I think the clustering and identification for S1 is somehow like for S2 and therefore not specific ( narrow ) enough (at least for Kr83m analysis) However, of course for dark matter search a second s1 is not expected in this time window but I think it is worth to optimize the identification and classificiation for a broad spectrum of energy from background to high-energy S1 to keep the acceptance quite high. Especially if one looks into the very good noise supression of PAX compared to XeRawDP for Run14, I think it is worth investing at least my time to investigate and improve PAX also for this kind of search and then make conclusions on the whole picture. If you don´t mind I would keep on working on this.

So summarizing, for S1 based analysis the clustering seems to be not functional already, but I think developing a smart adaptive way, is worth the effort. My final question is, if I did something wrong ( e.g. chosing the wrong .ini) or if this is just something some work has to be done?

(*) For the first image, maybe one recognized, that there is a population close around 0 ns before the gap to 500 ns. My current guess is that this is triggered by a "Veto S1" but I will investigate this further and reopen another discussion if needed.

JelleAalbers commented 9 years ago

Hi Alex,

Thanks for your thorough report on this issue. Indeed it looks like the clustering is performing poorly on these kind of peaks. We have not spent much time optimizing the clustering, let alone classification -- partially because we don't want to over-optimize on Xenon100, partly because we've got simple algorihtms that work ok for testing purposes, but also we just haven't gotten around to it. So there's definitely an opportunity here if you want to get involved with pax development!

The clustering algorithm that's currently the default in pax (GapSize) searches for large gaps between hits. The relevant settings are in Xenon100.ini:

[Cluster.GapSize]
# If there is a gap between hits larger than this, it will make a new cluster
large_gap_threshold = 450 * ns

# If the area in a cluster is larger than this, it is certainly not a single electron, so we can use...
transition_point = 50  #pe

# ... a tighter gap threshold:
small_gap_threshold = 100 * ns

It starts with a 450 ns threshold, and while it adds more hits to a cluster (moving forward in time), it at some point switches to a lower threshold (100). Chris has a nice idea to make this shift gradual, (see #129), but we haven't gotten around to coding it yet.

I think there's a few tings you / we can do, in order of increasing difficulty:

1) Change the settings, in particular, lower the gap threshold. You could eventually make an ini file especially for Kr83m analysis. However, if we do this, S2s will more likely get chopped into pieces, as you point out. We have some infrastructure to find an optimum threshold based on simulated waveforms (PeakFinderTest in XeAnalysisScripts), and it may well be the current threshold is simply poorly balanced. It may also be that a nicely balanced threshold is still be quite poor...

2) Try MeanShift clustering. Chris made a mean-shift clustering algorithm, which may do a lot better on your S1s. You can activate it by changing Cluster.GapSize to Cluster.MeanShift in the plugin list. It's a bit slow though, and has problems at very high energies / very few pmts (as it treats hits as points instead of intervals).

3) Develop a new clustering algorithm. Chris and I were thinking of using either kernel density estimation or, like Xerawdp, looking for local minima in the sum waveform (but we would use the hits-only sum waveform: the black thing in the waveform plot). The former may have the same problem as meanshift at high energies; the latter would mean you can no longer redo clustering during partial reprocessing: you'd need the raw data.

4) Think deeply about a way to join clustering and classification. We've been too scared to try this, maybe for good reasons. This would complicate the processing quite a bit (pre-classification? pre-clustering?) but as S1s and S2s are quite different beasts it can help us out a lot. Keep in mind though, there are also S1s and S2s close together, weird cases like clusters of S2s, clusters of S1s (e.g. your Kr analysis!), and unknown things as well... So it's not trivial. One way might be to enhance pax's classification to label all and only 2-S1 clusters as 'double_s1', then hand those to a specialized declustering plugin.

Jelle

tunnell commented 9 years ago

How do you want clustering to work? One key thing to know is that we now have a waveform simulator that measures how good we are at splitting S1s, S2s, and S1/S2 pairs. In XeRawDP, how the S1 findind worked as a function of time difference wasn't known. We can optimize how this clustering works and know exactly how it works for acceptance estimation. The philosophical issue we're facing is that S1s near one another look like an S2. We can either call also the subpieces S1s, call it an S2, or call it both and try to assess it later. This unknown category happens when it's hard for us to know if it's an S1 or S2. This is all tunable though.

alexfieg commented 9 years ago

Thanks guys for your answers, comments, information and suggestions.

I think the Kr83m data is a nice thing for comparison with XeRawDP and also performance test on s1 and also s2 (since there are possibilities of two fast s2 after each other). I will have also a look into the waveform simulator (nice tool!). Of course a dedicated config for Kr83m will make sense for this particular analysis, but with Jelle´s Suggestions I can maybe try to make more general methods and statements for the whole picture by looking deeper into the clustering and identification. It is a lot of fun stuff to do at least ;) I think I can close this here, and start a new discussion once I have something to work with.

tunnell commented 9 years ago

We should probably just have a call at some point

From: alexfieg notifications@github.com Reply: XENON1T/pax reply@reply.github.com> Date: May 20, 2015 at 13:41:43 To: XENON1T/pax pax@noreply.github.com> Cc: Christopher Tunnell ctunnell@nikhef.nl> Subject:  Re: [pax] First look at Kr83m with PAX - Improvement of clustering needed (#158)

Thanks guys for your answers, comments, information and suggestions.

I think the Kr83m data is a nice thing for comparison with XeRawDP and also performance test on s1 and also s2 (since there are possibilities of two fast s2 after each other). I will have also a look into the waveform simulator (nice tool!). Of course a dedicated config for Kr83m will make sense for this particular analysis, but with Jelle´s Suggestions I can maybe try to make more general methods and statements for the whole picture by looking deeper into the clustering and identification. It is a lot of fun stuff to do at least ;) I think I can close this here, and start a new discussion once I have something to work with.

— Reply to this email directly or view it on GitHub.