Want to merge functionality of `events` into `combine`

FaroutYLq commented 5 months ago

It is known that somehow with the current structure, in SR1 there is a ~10% chance that plugins computed in peaklets level (like veto_interval and peaklets) end up having different length when processing (peaks, peaklet_classification) and (event_info, veto_proximity).

We suspect something tricky happened when combining. To make these runs fail immediately when issues happen, even before we upload their combined peaklets, we want to load test in combine job after computing:

st.get_array(run, ("peaks", "peak_basics", "peak_positions")) 
st.get_array(run, ("event_info", "cut_daq_veto"))

Once failed, nothing will be uploaded.

FaroutYLq commented 5 months ago

If it is indeed something tricky happens in combine, shouldn't the problematic runs' peaklets and lone_hits always trigger a trouble. Let's try erase those who failed down to peaklets+lone_hits, and then reprocess on RCC.

If you can succeed, then peaklets+lone_hits are OK, and after we merge events into combine we will forever solve the length mismatch problem
If you cannot, and keep running into same length problem, it means somehow the peaklets+lone_hits themselves are broken. We want to check the loaded length and the length promised in meta data.

FaroutYLq commented 5 months ago

Additionally, such merging will benefit us in the sense that we will have less time wasted in transferring between sites, and search for nodes. The downside will be losing the flexibility of computing peaks/events only. I would suggest we add a super_events instead of overwriting existing combine workflow.

FaroutYLq commented 5 months ago

Test: some outputs from combined jobs keep getting same error, even erasing everything above peaklets and reprocess on dali, it keeps failing. Example run 049374, using /scratch/midway2/yuanlq/corruption_museum/. It means, indeed something tricky happened in combine. The

ValueError: Cannot merge chunks with different number of items: [[049374.peaklets: 1670689659sec 188225560 ns - 1670689702sec 137898520 ns, 83481 items, 8.2 MB/s], [049374.peaklet_classification: 1670689659sec 188225560 ns - 1670689702sec 137898520 ns, 83482 items, 0.0 MB/s]]
Python script failed with exit code 1

Edit: it turns out to be an old peaklet_classification causing the trouble

yuanlq@dali003:/dali/lgrandi/xudc/scratch-midway2/bk/bk-0602_del/nton/Make/logs$ ls -lh /gpfs3/cap/.snapshots/weekly-2024-06-02.04h07/dali/lgrandi/xenonnt/processed/049374-peaklet_classification-p3m6pr2fhz
total 2.5K
-rw-rwxr--+ 1 yuanlq yuanlq  92M Feb  4 23:57 peaklet_classification-p3m6pr2fhz-000000
-rw-rwxr--+ 1 yuanlq yuanlq  93M Feb  4 23:57 peaklet_classification-p3m6pr2fhz-000001
-rw-rwxr--+ 1 yuanlq yuanlq  93M Feb  4 23:57 peaklet_classification-p3m6pr2fhz-000002
-rw-rwxr--+ 1 yuanlq yuanlq  52M Feb  4 23:57 peaklet_classification-p3m6pr2fhz-000003
-rw-rwxr--+ 1 yuanlq yuanlq 6.1K Feb  4 23:57 peaklet_classification-p3m6pr2fhz-metadata.json

FaroutYLq commented 5 months ago

Now we want to download rr and process same run from scratch and compare the peaklets

FaroutYLq commented 5 months ago

We did some test on 049374. Starting from raw_records, while processing on OSG, the total length of peaklets is 37572935, while on DaLI it is 37572938. Both can get ("peaklets", "peaklet_classification") without problem. Those missing peaklets from OSG is not having patterns in relative timing to chunk boundaries. See lots of truncation at the end, but no other waveform feature. Are they near DAQ veto?

FaroutYLq commented 5 months ago

Example

FaroutYLq commented 5 months ago

Two things need to be understood:

In a peaklets job, do we rechunk: No
In a combine job, how do we rechunk: Yes. By copy_to_frontend

FaroutYLq commented 5 months ago

More investigation shows that split peaks perform differently on different machines. Suspected to be floating point issue. Example here.

Processed on DaLI: max_goodness_of_split
[0.3362855  0.6615487  0.19245173 0.33158863 0.7406279  0.79679215
 0.80455685 0.         0.         0.         0.5374566  0.59021497
 0.40632284 0.53420705 0.         0.3954491  0.         0.8004727
 0.46013355 0.79623634 0.48184547 0.4689006 ]
Processed on OSG: max_goodness_of_split
[0.4330755  0.18698996 0.3486709  0.7406338  0.79679585 0.8045561
 0.         0.         0.         0.5374318  0.5903075  0.40630853
 0.53418773 0.         0.3954804  0.         0.8004847  0.46006623
 0.7962359  0.48178786 0.46880245]

FaroutYLq commented 5 months ago

Maybe this architecture requirement is still not enough? This part of peak splitting has too much numba magic, and might be vulnerable, especially the nogil. We expect the single thread processor help if this is the crux. However, keep in mind that we already require single CPU core when processing on OSG. We might also want to check if there is some hyper-threading thing in strax.

FaroutYLq commented 5 months ago

Given the machine dependence, the following scenario will trigger issue:

You processed peaklets on OSG in try 1, got length 10086
You processed peaklet_classification on DaLI or OSG in try1, got length 10086
You found it corrupted in rucio and erased everything. However the peaklet_classification happened to be not erased. It is then downloaded from somewhere. They happen to be of different length

FaroutYLq commented 5 months ago

More details in the tests

XENONnT / outsource

Want to merge functionality of `events` into `combine` #150