Cta dl1 support - Githubissues

LukasNickel commented 4 years ago

So, this is kinda huge and also not really finished. Try it out, find bugs, whatever. There are some things, that we could consider refactoring, like putting some code from the apply scripts into functions in the IO part (writing predictions for example). The IO code is somewhat messy, because you need to perform some merges on different dataframes to get all of the information and chunkwise reading is a bit tricky.

An overview of the status:

The ML scripts seem to work. Training and Applying. Predictions are saved in a new /dl2 group. I have not checked the resulting predictions!
Internally, the same flat pandas DataFrames are used, thats where the merges come into play.
I'm still working on writing out files in the dl1 format for the cuts and train/test split scripts. Cuts seem to work without chunks. There is some bug when using chunks, that leads to different event counts afterwards. Train/test still needs to be implemented.
In general, chunkwise reading is somewhat inefficient, because you still need to read the complete trigger, monitoring, ... tables.
The new configuration key data_format (options simple and CTA) replaces most of the event_key, telescope_event_key, has_muliple_telescopes, coordinate_transformation stuff. If there is a use case for e.g. flat DataFrames with CTA data, please report and we can get the cordinate_transformation key back in.
Support for Kai's dataformat with a runs, array_events, andtelescope_events` table is removed. Only flat dataframes or CTA dl1 files. Units tests and examples are adapted accordingly.
I have tested this with simtel-files processed with the stage1-tool. Might need to cross check for the LST1 files and real observations with multiple pointings.

LukasNickel commented 4 years ago

I'm still learning to work with pytables/h5py, maybe some things can be implemented more efficiently. split_data and apply_cuts should work now, but there is no support for chunkwise reading. For the apply scripts it should work, but for now I would advise to just avoid it altogether.

Tests are not failing, I will test the results next.

LukasNickel commented 4 years ago

Adressed most comments and fixed some bugs.

ToDo:

The IO code can probably be split into a read_cta and read_simple function respectively
Chunks are not really tested anywhere. This will come once everything else is done. We might also want a merge_cta_files script or accept a list of files for training.
Events are read one table at a time. This means that n_events might only contain events of telescope 1 or similar.
Apply_cuts should work properly on pure dl1 files. Events in the dl2 group might remain though even if the dl1 event was removed. Right now this only matters if you want to apply the models first and use cuts afterwards. Although I cant find a reason to do that.
Split_data I havent tested thoroughly yet.

LukasNickel commented 4 years ago

Apart from the apply_cuts issue everything should now be functional with cta files.

maxnoe commented 3 years ago

@LukasNickel Would be nice if we could have it for next week. Can you resolve conflicts? I will make a review.

LukasNickel commented 3 years ago

Adressed most things and also fixed some minor bugs, most notably missing user attributes. ToDo:

Add alt/az predictions? dl1_to_dl2 script? Maybe a follow up PR
Infer units from the table attributes (at least for CTA files. We probably still keep the config keys for arbitrary files).
Hillas fov lon/lat requires additional work regarding the transformations (at least an if version > X: transform to offset frame). Also for a new PR most likely as its not even in the ctapipe master
Figure out if we can ditch the datamodel_version. Its a bit weird because the config objects contain the columns to be read and are not themself associated to an actual file.
As noted above: Stereo parameters (Probably a follow up PR)

maxnoe commented 3 years ago

I tried this locally and had a look a the code again. Some more comments:

the apply cuts script is very slow and inefficient on the CTA data. Especially the set to check for surviving events. Checking each and every obs id / event id in the set is expensive, even when using a set.
apply cuts fails when the output file already exists. It should overwrite the output file.
Loading data is also very slow. So slow It didn't finish for the merged dl1 LST file and I had to kill it. It was stuck inside the pandas merge df with the pointing table. This merge shouldn't even be necessary, since you need to interpolate the pointing, not merge it.

LukasNickel commented 3 years ago

The test is now failing because the dropping of columns before applying the model does not work in the case of the apply dxdy script.

fact-project / aict-tools

Cta dl1 support #142