Define API for DataLoader output

aribrill commented 5 years ago

With the addition of multiple telescope/camera types and need to include more types of auxiliary information, the API for DataLoader should be more explicitly defined. There are three kinds of image data to return: 1) single telescope, 2) array of one telescope type with triggers, and 3) array of multiple telescope types with triggers. The practical difference between case 2 and case 3 just using one tel type is that the name of the data array may be generic in case 2 (telescope_data) but not in case 3. The removal of case 2 in #74 is the cause of the difficulty noted in #81, and it should be restored.

In addition, there are two kinds of auxiliary data: telescope-based (anything included in the ImageExtractor Array Info table, in practice just positions) and event-based (everything included in the ImageExtractor Event Info table). Note that any information used as event-based auxiliary data could be used as a label and vice-versa.

The telescope positions are currently normalized in DataLoader, but this behavior is processing which may not always be desired. This functionality should be moved to DataProcessor and made a configuration option.

Then, the following configuration parameters in Data:Loading need to added, changed, or ~~removed~~ (the parameter names are open to change):

_exampletype: single_tel, array_one_tel_type, array_multiple_tel_types
_selected_teltypes: tel_type for single_tel or array_one_tel_type, or list of tel_types for array_multiple_tel_types
telescope_auxiliary_data: None, positions
event_auxiliary_data: None or a list of desired auxiliary data
labels: None or a list of desired labels
~~merge_tel_types~~: confusing and unnecessary to support

All combinations are valid and should be supported except for single tel examples with telescope auxiliary data. At least the following event auxiliary data / labels should be supported: particle id (aka gamma_hadron_label), MC energy (for signal efficiency #58 and energy reconstruction #67), event zenith and azimuth (angular reconstruction #67).

DataLoader should return a structure describing the examples, mapping the index of each output array to its name, shape (from DataProcessor), dtype, and output type (data or label). This structure should be included in the metadata, where it would subsume some current parameters such as image_shapes and total_aux_params. It would be used both by models to parse the data arrays and by run_model.py to construct the input_fn, eliminating the need to access DataLoader attributes directly (#46). It may also make sense to eliminate the DataLoader.get_auxiliary_data() method, since it would be somewhat ambiguous, and replace it with a method to explicitly return the telescope positions.

aribrill commented 5 years ago

I propose we discuss and work on this during/after next week's workshop.

aribrill commented 5 years ago

This issue is currently being worked on by developing a reader in Dl1DataHandler and once it's ready, migrating to it.

ABHIJIT-13 commented 5 years ago

hello, I am interested in contributing to the project regression under gsoc 2019. I am very much interested in this project and I believe my skill set matches the required ones. I was unable to get in contact with the mentors as no one in the gitter lobby knows the slackroom for the above project. Any help would be appreciated. Thank you

ctlearn-project / ctlearn

Define API for DataLoader output #82