Open FaroutYLq opened 5 months ago
If it is indeed something tricky happens in combine
, shouldn't the problematic runs' peaklets
and lone_hits
always trigger a trouble. Let's try erase those who failed down to peaklets
+lone_hits
, and then reprocess on RCC.
events
into combine
we will forever solve the length mismatch problemAdditionally, such merging will benefit us in the sense that we will have less time wasted in transferring between sites, and search for nodes. The downside will be losing the flexibility of computing peaks/events only. I would suggest we add a super_events
instead of overwriting existing combine
workflow.
Test: some outputs from combined jobs keep getting same error, even erasing everything above
peaklets
and reprocess on dali, it keeps failing. Example run 049374, using/scratch/midway2/yuanlq/corruption_museum/
. It means, indeed something tricky happened incombine
. TheValueError: Cannot merge chunks with different number of items: [[049374.peaklets: 1670689659sec 188225560 ns - 1670689702sec 137898520 ns, 83481 items, 8.2 MB/s], [049374.peaklet_classification: 1670689659sec 188225560 ns - 1670689702sec 137898520 ns, 83482 items, 0.0 MB/s]] Python script failed with exit code 1
Edit: it turns out to be an old peaklet_classification causing the trouble
yuanlq@dali003:/dali/lgrandi/xudc/scratch-midway2/bk/bk-0602_del/nton/Make/logs$ ls -lh /gpfs3/cap/.snapshots/weekly-2024-06-02.04h07/dali/lgrandi/xenonnt/processed/049374-peaklet_classification-p3m6pr2fhz total 2.5K -rw-rwxr--+ 1 yuanlq yuanlq 92M Feb 4 23:57 peaklet_classification-p3m6pr2fhz-000000 -rw-rwxr--+ 1 yuanlq yuanlq 93M Feb 4 23:57 peaklet_classification-p3m6pr2fhz-000001 -rw-rwxr--+ 1 yuanlq yuanlq 93M Feb 4 23:57 peaklet_classification-p3m6pr2fhz-000002 -rw-rwxr--+ 1 yuanlq yuanlq 52M Feb 4 23:57 peaklet_classification-p3m6pr2fhz-000003 -rw-rwxr--+ 1 yuanlq yuanlq 6.1K Feb 4 23:57 peaklet_classification-p3m6pr2fhz-metadata.json
Now we want to download rr and process same run from scratch and compare the peaklets
We did some test on 049374
. Starting from raw_records
, while processing on OSG, the total length of peaklets
is 37572935, while on DaLI it is 37572938. Both can get ("peaklets", "peaklet_classification")
without problem. Those missing peaklets from OSG is not having patterns in relative timing to chunk boundaries. See lots of truncation at the end, but no other waveform feature. Are they near DAQ veto?
Example
Two things need to be understood:
More investigation shows that split peaks perform differently on different machines. Suspected to be floating point issue. Example here.
Processed on DaLI: max_goodness_of_split
[0.3362855 0.6615487 0.19245173 0.33158863 0.7406279 0.79679215
0.80455685 0. 0. 0. 0.5374566 0.59021497
0.40632284 0.53420705 0. 0.3954491 0. 0.8004727
0.46013355 0.79623634 0.48184547 0.4689006 ]
Processed on OSG: max_goodness_of_split
[0.4330755 0.18698996 0.3486709 0.7406338 0.79679585 0.8045561
0. 0. 0. 0.5374318 0.5903075 0.40630853
0.53418773 0. 0.3954804 0. 0.8004847 0.46006623
0.7962359 0.48178786 0.46880245]
Maybe this architecture requirement is still not enough? This part of peak splitting has too much numba magic, and might be vulnerable, especially the nogil
. We expect the single thread processor help if this is the crux. However, keep in mind that we already require single CPU core when processing on OSG. We might also want to check if there is some hyper-threading thing in strax.
Given the machine dependence, the following scenario will trigger issue:
It is known that somehow with the current structure, in SR1 there is a ~10% chance that plugins computed in peaklets level (like
veto_interval
andpeaklets
) end up having different length when processing (peaks
,peaklet_classification
) and (event_info
,veto_proximity
).We suspect something tricky happened when combining. To make these runs fail immediately when issues happen, even before we upload their combined peaklets, we want to load test in combine job after computing:
Once failed, nothing will be uploaded.