aeye-lab / pymovements

A python package for processing eye movement data
https://pymovements.readthedocs.io
MIT License
57 stars 11 forks source link

add a gaze.from_file() facade #714

Open dkrako opened 2 months ago

dkrako commented 2 months ago

Description of the problem

Depending on the type of the file we need to call either gaze.from_asc(), gaze.from_csv() or gaze.from_ipc(). For future programmatic pipeline definitions, ideally written in a more general filetype independent way, a single function would be much nicer to use instead.

Description of a solution

The difficulty lies in resolving the completely different signatures, but we could just refactor dataset.dataset_files.load_gaze_file() into gaze.from_file() and reuse the DatasetDefinition. This would be sufficient for the ongoing work on general pipelines and if desired this can be refined with additional arguments in a follow up work.

One additional difficulty lies in resolving the if preprocessed branch in dataset.dataset_files.load_gaze_file().

I would propose to use a more descriptive argument instead of preprocessed, for example auto_nest: bool, and put the additional nesting at the end of from_file() regardless of the filetype.

Proposed signature:

def from_file(
        filepath: Path,
        *,
        definition: DatasetDefinition,
        meta: dict[str, Any] | None  = None,
        read_kwargs: dict[str, Any] | None = None,
        auto_nest: bool = False,
) -> GazeDataFrame:

But then I see that passing the column name attributes from the DatasetDefinition also wouldn't work.

So basically we need to find a way to get rid of the if preprocessed branch of code: https://github.com/aeye-lab/pymovements/blob/cb9ef9571c5b24f7609928d18efc3ae2520c1d03/src/pymovements/dataset/dataset_files.py#L270-L330

A good alternative could be to move the if preprocessed branch to load_gaze_files() instead of having it in load_gaze_file(). Maybe that's the cleanest way for now, as it doesn't break any behavior and lends us some time on how we handle the nesting issue in cached preprocessed csv files. In the long run the issue could be resolved further by adding auto column detection during initialization of a GazeDataFrame.

Minimum acceptance criteria

dkrako commented 2 months ago

715 should be resolved first, and it would solve the auto nest issues and remove a large part of the if preprocessed branch.