add support for event-only datasets

dkrako commented 2 months ago

Description of the problem

There are a lot of datasets published that only include event data and no raw gaze samples.

It would be great if the dataset library could also support these.

Description of a solution

The DatasetDefinition should probably extended with a new attribute that signifies the type of data to expect.

My first proposal would be datatype: Literal['raw', 'events']. Let's brainstorm if we can find a better name.

Additionally, the Dataset.load() method must be changed accordingly:

https://github.com/aeye-lab/pymovements/blob/cb9ef9571c5b24f7609928d18efc3ae2520c1d03/src/pymovements/dataset/dataset.py#L77-L135

Currently, it always runs Dataset.load_gaze_files(), but this should then be dependent on DatasetDefinition.datatype. Also, events should be set to events: bool | None = None and later assume a boolean value dependent on DatasetDefinition.datatype

I propose the following signature:

    def load(
            self,
            *,
            gaze: bool | None = None,
            events: bool | None = None,
            preprocessed: bool = False,
            subset: dict[str, float | int | str | list[float | int | str]] | None = None,
            events_dirname: str | None = None,
            preprocessed_dirname: str | None = None,
            extension: str = 'feather',
    ) -> Dataset:

This would be backwards compatible:

if dataset.definition.datatype == 'raw' and gaze is None:
    gaze = True
else:
    gaze = False

if dataset.definition.datatype == 'events' and events is None:
    events = True
else:
    events = False

Minimum acceptance criteria

[ ] add argument to DatasetDefinition to indicate type of dataset
[ ] adjust Dataset.load() default values to decide loading gaze or events during runtime

dkrako commented 2 months ago

maybe instead of defining the dataset type via datatype, it would make more sense to define properties like has_gaze_files and has_event_files (and even has_stimulus_files)? Maybe this could be more flexible in the future for further enhancements (think of has_eeg_files).

SiQube commented 2 months ago

following discussion point:

we already have events in the load definition:

 def load(
        self,
        events: bool = False,
        preprocessed: bool = False,
        subset: dict[str, float | int | str | list[float | int | str]] | None = None,
        events_dirname: str | None = None,
        preprocessed_dirname: str | None = None,
        extension: str = 'feather',
 ) -> Dataset:

should I change those to has_gaze_files, has_eventsfiles, has... ?

dkrako commented 2 months ago

I think the signature of Dataset.load() is a bit different. While DatasetDefinition.has_event_files indicates if there are event files in the downloaded dataset to begin with, the events argument in Dataset.load() signifies if event data should be loaded (these can be events detected by pymovements and then saved with Dataset.save()).

I wouldn't rename the load() arguments (for now), as we don't need that breaking change. We can discuss if we want to rename/deprecate the argument in a followup PR.

I would propose that the events argument can assume three values:

True: try to load event files
False don't load any event files
None or 'auto': if DatasetDefinition.has_event_files is True, load event files, if False don't load any event files

aeye-lab / pymovements