XENONnT / outsource

Job submission of reprocessing
4 stars 2 forks source link

Want to merge functionality of `events` into `combine` #150

Open FaroutYLq opened 5 months ago

FaroutYLq commented 5 months ago

It is known that somehow with the current structure, in SR1 there is a ~10% chance that plugins computed in peaklets level (like veto_interval and peaklets) end up having different length when processing (peaks, peaklet_classification) and (event_info, veto_proximity).

We suspect something tricky happened when combining. To make these runs fail immediately when issues happen, even before we upload their combined peaklets, we want to load test in combine job after computing:

st.get_array(run, ("peaks", "peak_basics", "peak_positions")) 
st.get_array(run, ("event_info", "cut_daq_veto"))

Once failed, nothing will be uploaded.

FaroutYLq commented 5 months ago

If it is indeed something tricky happens in combine, shouldn't the problematic runs' peaklets and lone_hits always trigger a trouble. Let's try erase those who failed down to peaklets+lone_hits, and then reprocess on RCC.

FaroutYLq commented 5 months ago

Additionally, such merging will benefit us in the sense that we will have less time wasted in transferring between sites, and search for nodes. The downside will be losing the flexibility of computing peaks/events only. I would suggest we add a super_events instead of overwriting existing combine workflow.

FaroutYLq commented 5 months ago

Test: some outputs from combined jobs keep getting same error, even erasing everything above peaklets and reprocess on dali, it keeps failing. Example run 049374, using /scratch/midway2/yuanlq/corruption_museum/. It means, indeed something tricky happened in combine. The

ValueError: Cannot merge chunks with different number of items: [[049374.peaklets: 1670689659sec 188225560 ns - 1670689702sec 137898520 ns, 83481 items, 8.2 MB/s], [049374.peaklet_classification: 1670689659sec 188225560 ns - 1670689702sec 137898520 ns, 83482 items, 0.0 MB/s]]
Python script failed with exit code 1

Edit: it turns out to be an old peaklet_classification causing the trouble

yuanlq@dali003:/dali/lgrandi/xudc/scratch-midway2/bk/bk-0602_del/nton/Make/logs$ ls -lh /gpfs3/cap/.snapshots/weekly-2024-06-02.04h07/dali/lgrandi/xenonnt/processed/049374-peaklet_classification-p3m6pr2fhz
total 2.5K
-rw-rwxr--+ 1 yuanlq yuanlq  92M Feb  4 23:57 peaklet_classification-p3m6pr2fhz-000000
-rw-rwxr--+ 1 yuanlq yuanlq  93M Feb  4 23:57 peaklet_classification-p3m6pr2fhz-000001
-rw-rwxr--+ 1 yuanlq yuanlq  93M Feb  4 23:57 peaklet_classification-p3m6pr2fhz-000002
-rw-rwxr--+ 1 yuanlq yuanlq  52M Feb  4 23:57 peaklet_classification-p3m6pr2fhz-000003
-rw-rwxr--+ 1 yuanlq yuanlq 6.1K Feb  4 23:57 peaklet_classification-p3m6pr2fhz-metadata.json
FaroutYLq commented 5 months ago

Now we want to download rr and process same run from scratch and compare the peaklets

FaroutYLq commented 5 months ago

We did some test on 049374. Starting from raw_records, while processing on OSG, the total length of peaklets is 37572935, while on DaLI it is 37572938. Both can get ("peaklets", "peaklet_classification") without problem. Those missing peaklets from OSG is not having patterns in relative timing to chunk boundaries. See lots of truncation at the end, but no other waveform feature. Are they near DAQ veto?

FaroutYLq commented 5 months ago

Example image image image image

FaroutYLq commented 5 months ago

Two things need to be understood:

FaroutYLq commented 5 months ago

More investigation shows that split peaks perform differently on different machines. Suspected to be floating point issue. Example here.

Processed on DaLI: max_goodness_of_split
[0.3362855  0.6615487  0.19245173 0.33158863 0.7406279  0.79679215
 0.80455685 0.         0.         0.         0.5374566  0.59021497
 0.40632284 0.53420705 0.         0.3954491  0.         0.8004727
 0.46013355 0.79623634 0.48184547 0.4689006 ]
Processed on OSG: max_goodness_of_split
[0.4330755  0.18698996 0.3486709  0.7406338  0.79679585 0.8045561
 0.         0.         0.         0.5374318  0.5903075  0.40630853
 0.53418773 0.         0.3954804  0.         0.8004847  0.46006623
 0.7962359  0.48178786 0.46880245]

image image

FaroutYLq commented 5 months ago

Maybe this architecture requirement is still not enough? This part of peak splitting has too much numba magic, and might be vulnerable, especially the nogil. We expect the single thread processor help if this is the crux. However, keep in mind that we already require single CPU core when processing on OSG. We might also want to check if there is some hyper-threading thing in strax.

FaroutYLq commented 5 months ago

Given the machine dependence, the following scenario will trigger issue:

FaroutYLq commented 5 months ago

More details in the tests