cta-observatory / ctapipe

Low-level data processing pipeline software for CTAO or similar arrays of Imaging Atmospheric Cherenkov Telescopes
https://ctapipe.readthedocs.org
BSD 3-Clause "New" or "Revised" License
65 stars 269 forks source link

Check uniqueness of input paths and obs_ids in merge tool #2611

Closed maxnoe closed 2 months ago

maxnoe commented 2 months ago

This is a very simple check to partially address #2610

I will add the check for obs_id in another PR to fully address #2610

maxnoe commented 2 months ago

The only real unit tests testing the merging thoroughly is merging the only two large files we have: the gamma and the proton training dl2 files.

However, due to the simulations re-using run ids for the different particles, the tool now correctly complains about duplicated obs_ids.

I am not sure how to address this, I think we should keep the test but adding an option to disable the uniqueness check just to test the that the merging works on this invalid input also seems strange.

I could modify the file on disk to have different obs_ids...

maxnoe commented 2 months ago

I could modify the file on disk to have different obs_ids...

This was pretty simple, so I opted for this approach

kosack commented 2 months ago

What happens if we do want to make a file with multiple particles types mixed in it, and they have overlapping obs_ids? I guess that requires a larger change since we assume obs_ids are unique, right? It's maybe not a very common use case, but I guess we should at least mention this in the docs of the merge tool, e.g. that we do not allow mixing of the same obs_id.

Or we just fix this in the OB/SB data model where the obs_ids are in the future supposed to have more digits pre-pended to make them more unique (north, south, simulation), so maybe in that case, we could consider defining the simulation pre-id to be something like (simulation_pre_id + particle_id)*X+obs_id so that all particles have a unique obs_id.

maxnoe commented 2 months ago

What happens if we do want to make a file with multiple particles types mixed in it, and they have overlapping obs_ids?

This is something that is not supported now in any case: the data model assumes that we have unique obs_id / event_id combinations and on this assumptions many things are build (TableLoader, StereoPrediction, etc.).

This is why we added the override_obs_id option to SimTelEventSource, so that we can change the obs_ids at the start to something unique.

kosack commented 2 months ago

This is why we added the override_obs_id option to SimTelEventSource, so that we can change the obs_ids at the start to something unique.

Yes, that's clear. So nothing to do in ctapipe, but the question is if we should update the way obs_ids are re-assigned for simulations so that each particle species is unique. That would have to just be a convention, not a software fix.

maxnoe commented 2 months ago

@GernotMaier or @orelgueta might confirm, but I think we talked about it and prod6 already has unique obs_ids, at least per site / pointing position

orelgueta commented 2 months ago

@GernotMaier or @orelgueta might confirm, but I think we talked about it and prod6 already has unique obs_ids, at least per site / pointing position

Unique obs_ids are only per site/pointing/particle. So we reuse the run numbers also for different primaries in the same pointing/site.

If I remember correctly our discussions, we realised that anyway you will have to assign a unique obs_id after the fact, so there is no reason for us to change the behaviour (which is done manually when launching the jobs so quite prone to error).

maxnoe commented 2 months ago

Ok, I don't really like adding more meaning to obs_ids... but looks like this might be the best option for now.

Including particle id automatically is easy, site probably as well. Pointing position maybe not so much.

At least for now, all our tools that need multiple species as input are designed to read from multiple files anyway.