aeye-lab / pymovements

A python package for processing eye movement data
https://pymovements.readthedocs.io
MIT License
57 stars 11 forks source link

test: Add integration tests for public datasets #591

Closed dkrako closed 9 months ago

dkrako commented 9 months ago

Description

A first version to try out downloading and processing public datasets

This should fail until #517 is fixed

We have to find some solution such that integration tests are only tested very seldomly. (what about only when publishing new releases?)

For now I would just add --ignore=^tests/integration under [tool.pytest.ini_options] in pyproject.toml, because we definitely do not want to preprocess all datasets for each commit, because this would take forever (even if they are cached)

dkrako commented 9 months ago

The GitHub workers are failing the integration tests. I see three potential reasons:

  1. Dataset.load() loads a complete dataset into memory, if the workers have too little working memory, they will fail.
  2. Downloading all datasets could have too much disk usage
  3. Running the tests took too long.

Point 1. would be solved after we implement batched preprocessing, meaning that we won't keep all the dataset files in memory, but process N files in parallel. This way we only need to keep N files in memory.

If it's Point 2 or 3 then we have a problem.

I'm running the integration tests locally on our DGX, let's see what we get as output from that.

dkrako commented 9 months ago

We have one expected fail on GazeBase and one unexpected fail with SB-Sat.

Output

=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.9.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/krakowczyk/workspace/pymovements, configfile: pyproject.toml
plugins: anyio-3.7.1, dash-2.11.1, lazy-fixture-0.6.3, hydra-core-1.3.2, cov-3.0.0
collected 8 items                                                                                                                                                                                                 

tests/integration/public_dataset_processing_test.py .F...F..                                                                                                                                                [100%]

==================================================================================================== FAILURES =====================================================================================================
____________________________________________________________________________________ test_public_dataset_processing[GazeBase] _____________________________________________________________________________________

dataset_name = 'GazeBase', tmp_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing1')

    @pytest.mark.parametrize(
        'dataset_name',
        list(pm.dataset.DatasetLibrary.definitions.keys()),
    )
    def test_public_dataset_processing(dataset_name, tmp_path):
        # Initialize dataset.
        dataset_path = tmp_path / dataset_name
        dataset = pm.Dataset(dataset_name, path=dataset_path)

        # Download and load in dataset.
        dataset.download()
        dataset.load()

        # Do some basic transformations.
        if 'pixel' in dataset.gaze[0].columns:
            dataset.pix2deg()
>       dataset.pos2vel()

dataset    = <pymovements.dataset.dataset.Dataset object at 0x7f205c77dd30>
dataset_name = 'GazeBase'
dataset_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing1/GazeBase')
tmp_path   = PosixPath('/tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing1')

tests/integration/public_dataset_processing_test.py:42: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pymovements.dataset.dataset.Dataset object at 0x7f205c77dd30>, method = 'fivepoint', verbose = True, kwargs = {}

    def pos2vel(
            self,
            method: str = 'fivepoint',
            *,
            verbose: bool = True,
            **kwargs: Any,
    ) -> Dataset:
        """Compute gaze velocites in dva/s from dva coordinates.

        This method requires a properly initialized :py:attr:`~.Dataset.experiment` attribute.

        After success, the gaze dataframe is extended by the resulting velocity columns.

        Parameters
        ----------
        method : str
            Computation method. See :func:`~transforms.pos2vel()` for details, default: smooth.
        verbose : bool
            If True, show progress of computation.
        **kwargs
            Additional keyword arguments to be passed to the :func:`~transforms.pos2vel()` method.

        Raises
        ------
        AttributeError
            If `gaze` is None or there are no gaze dataframes present in the `gaze` attribute, or
            if experiment is None.

        Returns
        -------
        Dataset
            Returns self, useful for method cascading.
        """
>       return self.apply('pos2vel', method=method, verbose=verbose, **kwargs)

kwargs     = {}
method     = 'fivepoint'
self       = <pymovements.dataset.dataset.Dataset object at 0x7f205c77dd30>
verbose    = True

src/pymovements/dataset/dataset.py:393: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pymovements.dataset.dataset.Dataset object at 0x7f205c77dd30>, function = 'pos2vel', verbose = True, kwargs = {'method': 'fivepoint'}, disable_progressbar = False
gaze = <pymovements.gaze.gaze_dataframe.GazeDataFrame object at 0x7f205cbf8460>

    def apply(
            self,
            function: str,
            *,
            verbose: bool = True,
            **kwargs: Any,
    ) -> Dataset:
        """Apply preprocessing method to all GazeDataFrames in Dataset.

        Parameters
        ----------
        function: str
            Name of the preprocessing function to apply.
        verbose : bool
            If True, show progress bar of computation.
        kwargs:
            kwargs that will be forwarded when calling the preprocessing method.

        Examples
        --------
        Let's load in our dataset first,
        >>> import pymovements as pm
        >>>
        >>> dataset = pm.Dataset("ToyDataset", path='toy_dataset')
        >>> dataset.download()# doctest:+ELLIPSIS
        Downloading ... to toy_dataset...downloads...
        Checking integrity of ...
        Extracting ... to toy_dataset...raw
        <pymovements.dataset.dataset.Dataset object at ...>
        >>> dataset.load()# doctest:+ELLIPSIS
        <pymovements.dataset.dataset.Dataset object at ...>

        Use apply for your gaze transformations:
        >>> dataset.apply('pix2deg')# doctest:+ELLIPSIS
        <pymovements.dataset.dataset.Dataset object at ...>

        >>> dataset.apply('pos2vel', method='neighbors')# doctest:+ELLIPSIS
        <pymovements.dataset.dataset.Dataset object at ...>

        Use apply for your event detection:
        >>> dataset.apply('ivt')# doctest:+ELLIPSIS
        <pymovements.dataset.dataset.Dataset object at ...>

        >>> dataset.apply('microsaccades', minimum_duration=8)# doctest:+ELLIPSIS
        <pymovements.dataset.dataset.Dataset object at ...>
        """
        self._check_gaze_dataframe()

        disable_progressbar = not verbose
        for gaze in tqdm(self.gaze, disable=disable_progressbar):
>           gaze.apply(function, **kwargs)

disable_progressbar = False
function   = 'pos2vel'
gaze       = <pymovements.gaze.gaze_dataframe.GazeDataFrame object at 0x7f205cbf8460>
kwargs     = {'method': 'fivepoint'}
self       = <pymovements.dataset.dataset.Dataset object at 0x7f205c77dd30>
verbose    = True

src/pymovements/dataset/dataset.py:287: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pymovements.gaze.gaze_dataframe.GazeDataFrame object at 0x7f205cbf8460>, function = 'pos2vel', kwargs = {'method': 'fivepoint'}

    def apply(
            self,
            function: str,
            **kwargs: Any,
    ) -> None:
        """Apply preprocessing method to GazeDataFrame.

        Parameters
        ----------
        function: str
            Name of the preprocessing function to apply.
        kwargs:
            kwargs that will be forwarded when calling the preprocessing method.
        """
        if transforms.TransformLibrary.__contains__(function):
>           self.transform(function, **kwargs)

function   = 'pos2vel'
kwargs     = {'method': 'fivepoint'}
self       = <pymovements.gaze.gaze_dataframe.GazeDataFrame object at 0x7f205cbf8460>

src/pymovements/gaze/gaze_dataframe.py:252: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pymovements.gaze.gaze_dataframe.GazeDataFrame object at 0x7f205cbf8460>, transform_method = <function pos2vel at 0x7f1fbeda29d0>
kwargs = {'method': 'fivepoint', 'n_components': 2, 'sampling_rate': 1000}, method_kwargs = ['sampling_rate', 'method', 'n_components', 'degree', 'window_length', 'padding', ...]

    def transform(
            self,
            transform_method: str | Callable[..., pl.Expr],
            **kwargs: Any,
    ) -> None:
        """Apply transformation method."""
        if isinstance(transform_method, str):
            transform_method = transforms.TransformLibrary.get(transform_method)

        if transform_method.__name__ == 'downsample':
            downsample_factor = kwargs.pop('factor')
            self.frame = self.frame.select(
                transforms.downsample(
                    factor=downsample_factor, **kwargs,
                ),
            )

        else:
            method_kwargs = inspect.getfullargspec(transform_method).kwonlyargs
            if 'origin' in method_kwargs and 'origin' not in kwargs:
                self._check_experiment()
                assert self.experiment is not None
                kwargs['origin'] = self.experiment.screen.origin

            if 'screen_resolution' in method_kwargs and 'screen_resolution' not in kwargs:
                self._check_experiment()
                assert self.experiment is not None
                kwargs['screen_resolution'] = (
                    self.experiment.screen.width_px, self.experiment.screen.height_px,
                )

            if 'screen_size' in method_kwargs and 'screen_size' not in kwargs:
                self._check_experiment()
                assert self.experiment is not None
                kwargs['screen_size'] = (
                    self.experiment.screen.width_cm, self.experiment.screen.height_cm,
                )

            if 'distance' in method_kwargs and 'distance' not in kwargs:
                self._check_experiment()
                assert self.experiment is not None

                if 'distance' in self.frame.columns:
                    kwargs['distance'] = 'distance'

                    if self.experiment.screen.distance_cm:
                        warnings.warn(
                            "Both a distance column and experiment's "
                            'eye-to-screen distance are specified. '
                            'Using eye-to-screen distances from column '
                            "'distance' in the dataframe.",
                        )
                elif self.experiment.screen.distance_cm:
                    kwargs['distance'] = self.experiment.screen.distance_cm
                else:
                    raise AttributeError(
                        'Neither eye-to-screen distance is in the columns of the dataframe '
                        'nor experiment eye-to-screen distance is specified.',
                    )

            if 'sampling_rate' in method_kwargs and 'sampling_rate' not in kwargs:
                self._check_experiment()
                assert self.experiment is not None
                kwargs['sampling_rate'] = self.experiment.sampling_rate

            if 'n_components' in method_kwargs and 'n_components' not in kwargs:
                self._check_n_components()
                kwargs['n_components'] = self.n_components

            if transform_method.__name__ in {'pos2vel', 'pos2acc'}:
                if 'position' not in self.frame.columns and 'position_column' not in kwargs:
                    if 'pixel' in self.frame.columns:
                        raise pl.exceptions.ColumnNotFoundError(
                            "Neither 'position' is in the columns of the dataframe: "
                            f'{self.frame.columns} nor is the position column specified. '
                            "Since the dataframe has a 'pixel' column, consider running "
                            f'pix2deg() before {transform_method.__name__}(). If you want '
                            'to calculate pixel transformations, you can do so by using '
                            f"{transform_method.__name__}(position_column='pixel'). "
                            f'Available dataframe columns are {self.frame.columns}',
                        )
                    raise pl.exceptions.ColumnNotFoundError(
                        "Neither 'position' is in the columns of the dataframe: "
                        f'{self.frame.columns} nor is the position column specified. '
                        f'Available dataframe columns are {self.frame.columns}',
                    )
            if transform_method.__name__ in {'pix2deg'}:
                if 'pixel' not in self.frame.columns and 'pixel_column' not in kwargs:
                    raise pl.exceptions.ColumnNotFoundError(
                        "Neither 'position' is in the columns of the dataframe: "
                        f'{self.frame.columns} nor is the pixel column specified. '
                        'You can specify the pixel column via: '
                        f'{transform_method.__name__}(pixel_column="name_of_your_pixel_column"). '
                        f'Available dataframe columns are {self.frame.columns}',
                    )

            if self.trial_columns is None:
                self.frame = self.frame.with_columns(transform_method(**kwargs))
            else:
                self.frame = pl.concat(
>                   [
                        df.with_columns(transform_method(**kwargs))
                        for group, df in self.frame.groupby(self.trial_columns, maintain_order=True)
                    ],
                )

kwargs     = {'method': 'fivepoint', 'n_components': 2, 'sampling_rate': 1000}
method_kwargs = ['sampling_rate', 'method', 'n_components', 'degree', 'window_length', 'padding', ...]
self       = <pymovements.gaze.gaze_dataframe.GazeDataFrame object at 0x7f205cbf8460>
transform_method = <function pos2vel at 0x7f1fbeda29d0>

src/pymovements/gaze/gaze_dataframe.py:358: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

.0 = <polars.dataframe.groupby.GroupBy object at 0x7f204f4ded00>

        [
>           df.with_columns(transform_method(**kwargs))
            for group, df in self.frame.groupby(self.trial_columns, maintain_order=True)
        ],
    )

.0         = <polars.dataframe.groupby.GroupBy object at 0x7f204f4ded00>
df         = shape: (15_076, 11)
┌──────────┬────────────┬────────────┬───────────┬───┬──────────────┬──────┬─────┬────────────────...948"]  │
└──────────┴────────────┴────────────┴───────────┴───┴──────────────┴──────┴─────┴───────────────────────────┘
group      = (1, 2, 2, 'FXS')
kwargs     = {'method': 'fivepoint', 'n_components': 2, 'sampling_rate': 1000}
transform_method = <function pos2vel at 0x7f1fbeda29d0>

src/pymovements/gaze/gaze_dataframe.py:359: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = shape: (15_076, 11)
┌──────────┬────────────┬────────────┬───────────┬───┬──────────────┬──────┬─────┬────────────────...948"]  │
└──────────┴────────────┴────────────┴───────────┴───┴──────────────┴──────┴─────┴───────────────────────────┘
exprs = (<polars.expr.expr.Expr object at 0x7f205cbf8910>,), named_exprs = {}

    def with_columns(
        self,
        *exprs: IntoExpr | Iterable[IntoExpr],
        **named_exprs: IntoExpr,
    ) -> DataFrame:
        """
        Add columns to this DataFrame.

        Added columns will replace existing columns with the same name.

        Parameters
        ----------
        *exprs
            Column(s) to add, specified as positional arguments.
            Accepts expression input. Strings are parsed as column names, other
            non-expression inputs are parsed as literals.
        **named_exprs
            Additional columns to add, specified as keyword arguments.
            The columns will be renamed to the keyword used.

        Returns
        -------
        DataFrame
            A new DataFrame with the columns added.

        Notes
        -----
        Creating a new DataFrame using this method does not create a new copy of
        existing data.

        Examples
        --------
        Pass an expression to add it as a new column.

        >>> df = pl.DataFrame(
        ...     {
        ...         "a": [1, 2, 3, 4],
        ...         "b": [0.5, 4, 10, 13],
        ...         "c": [True, True, False, True],
        ...     }
        ... )
        >>> df.with_columns((pl.col("a") ** 2).alias("a^2"))
        shape: (4, 4)
        ┌─────┬──────┬───────┬──────┐
        │ a   ┆ b    ┆ c     ┆ a^2  │
        │ --- ┆ ---  ┆ ---   ┆ ---  │
        │ i64 ┆ f64  ┆ bool  ┆ f64  │
        ╞═════╪══════╪═══════╪══════╡
        │ 1   ┆ 0.5  ┆ true  ┆ 1.0  │
        │ 2   ┆ 4.0  ┆ true  ┆ 4.0  │
        │ 3   ┆ 10.0 ┆ false ┆ 9.0  │
        │ 4   ┆ 13.0 ┆ true  ┆ 16.0 │
        └─────┴──────┴───────┴──────┘

        Added columns will replace existing columns with the same name.

        >>> df.with_columns(pl.col("a").cast(pl.Float64))
        shape: (4, 3)
        ┌─────┬──────┬───────┐
        │ a   ┆ b    ┆ c     │
        │ --- ┆ ---  ┆ ---   │
        │ f64 ┆ f64  ┆ bool  │
        ╞═════╪══════╪═══════╡
        │ 1.0 ┆ 0.5  ┆ true  │
        │ 2.0 ┆ 4.0  ┆ true  │
        │ 3.0 ┆ 10.0 ┆ false │
        │ 4.0 ┆ 13.0 ┆ true  │
        └─────┴──────┴───────┘

        Multiple columns can be added by passing a list of expressions.

        >>> df.with_columns(
        ...     [
        ...         (pl.col("a") ** 2).alias("a^2"),
        ...         (pl.col("b") / 2).alias("b/2"),
        ...         (pl.col("c").is_not()).alias("not c"),
        ...     ]
        ... )
        shape: (4, 6)
        ┌─────┬──────┬───────┬──────┬──────┬───────┐
        │ a   ┆ b    ┆ c     ┆ a^2  ┆ b/2  ┆ not c │
        │ --- ┆ ---  ┆ ---   ┆ ---  ┆ ---  ┆ ---   │
        │ i64 ┆ f64  ┆ bool  ┆ f64  ┆ f64  ┆ bool  │
        ╞═════╪══════╪═══════╪══════╪══════╪═══════╡
        │ 1   ┆ 0.5  ┆ true  ┆ 1.0  ┆ 0.25 ┆ false │
        │ 2   ┆ 4.0  ┆ true  ┆ 4.0  ┆ 2.0  ┆ false │
        │ 3   ┆ 10.0 ┆ false ┆ 9.0  ┆ 5.0  ┆ true  │
        │ 4   ┆ 13.0 ┆ true  ┆ 16.0 ┆ 6.5  ┆ false │
        └─────┴──────┴───────┴──────┴──────┴───────┘

        Multiple columns also can be added using positional arguments instead of a list.

        >>> df.with_columns(
        ...     (pl.col("a") ** 2).alias("a^2"),
        ...     (pl.col("b") / 2).alias("b/2"),
        ...     (pl.col("c").is_not()).alias("not c"),
        ... )
        shape: (4, 6)
        ┌─────┬──────┬───────┬──────┬──────┬───────┐
        │ a   ┆ b    ┆ c     ┆ a^2  ┆ b/2  ┆ not c │
        │ --- ┆ ---  ┆ ---   ┆ ---  ┆ ---  ┆ ---   │
        │ i64 ┆ f64  ┆ bool  ┆ f64  ┆ f64  ┆ bool  │
        ╞═════╪══════╪═══════╪══════╪══════╪═══════╡
        │ 1   ┆ 0.5  ┆ true  ┆ 1.0  ┆ 0.25 ┆ false │
        │ 2   ┆ 4.0  ┆ true  ┆ 4.0  ┆ 2.0  ┆ false │
        │ 3   ┆ 10.0 ┆ false ┆ 9.0  ┆ 5.0  ┆ true  │
        │ 4   ┆ 13.0 ┆ true  ┆ 16.0 ┆ 6.5  ┆ false │
        └─────┴──────┴───────┴──────┴──────┴───────┘

        Use keyword arguments to easily name your expression inputs.

        >>> df.with_columns(
        ...     ab=pl.col("a") * pl.col("b"),
        ...     not_c=pl.col("c").is_not(),
        ... )
        shape: (4, 5)
        ┌─────┬──────┬───────┬──────┬───────┐
        │ a   ┆ b    ┆ c     ┆ ab   ┆ not_c │
        │ --- ┆ ---  ┆ ---   ┆ ---  ┆ ---   │
        │ i64 ┆ f64  ┆ bool  ┆ f64  ┆ bool  │
        ╞═════╪══════╪═══════╪══════╪═══════╡
        │ 1   ┆ 0.5  ┆ true  ┆ 0.5  ┆ false │
        │ 2   ┆ 4.0  ┆ true  ┆ 8.0  ┆ false │
        │ 3   ┆ 10.0 ┆ false ┆ 30.0 ┆ true  │
        │ 4   ┆ 13.0 ┆ true  ┆ 52.0 ┆ false │
        └─────┴──────┴───────┴──────┴───────┘

        Expressions with multiple outputs can be automatically instantiated as Structs
        by enabling the experimental setting ``Config.set_auto_structify(True)``:

        >>> with pl.Config(auto_structify=True):
        ...     df.drop("c").with_columns(
        ...         diffs=pl.col(["a", "b"]).diff().suffix("_diff"),
        ...     )
        ...
        shape: (4, 3)
        ┌─────┬──────┬─────────────┐
        │ a   ┆ b    ┆ diffs       │
        │ --- ┆ ---  ┆ ---         │
        │ i64 ┆ f64  ┆ struct[2]   │
        ╞═════╪══════╪═════════════╡
        │ 1   ┆ 0.5  ┆ {null,null} │
        │ 2   ┆ 4.0  ┆ {1,3.5}     │
        │ 3   ┆ 10.0 ┆ {1,6.0}     │
        │ 4   ┆ 13.0 ┆ {1,3.0}     │
        └─────┴──────┴─────────────┘

        """
        return (
>           self.lazy()
            .with_columns(*exprs, **named_exprs)
            .collect(no_optimization=True)
        )

exprs      = (<polars.expr.expr.Expr object at 0x7f205cbf8910>,)
named_exprs = {}
self       = shape: (15_076, 11)
┌──────────┬────────────┬────────────┬───────────┬───┬──────────────┬──────┬─────┬────────────────...948"]  │
└──────────┴────────────┴────────────┴───────────┴───┴──────────────┴──────┴─────┴───────────────────────────┘

/mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/polars/dataframe/frame.py:7631: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (<LazyFrame [12 cols, {"round_id": Int64 … "velocity": List(Utf8)}] at 0x7F1F14467A90>,), kwargs = {'no_optimization': True}

    @wraps(function)
    def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
        _rename_keyword_argument(
            old_name, new_name, kwargs, function.__name__, version
        )
>       return function(*args, **kwargs)

args       = (<LazyFrame [12 cols, {"round_id": Int64 … "velocity": List(Utf8)}] at 0x7F1F14467A90>,)
function   = <function LazyFrame.collect at 0x7f205fa84c10>
kwargs     = {'no_optimization': True}
new_name   = 'comm_subplan_elim'
old_name   = 'common_subplan_elimination'
version    = '0.18.9'

/mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/polars/utils/deprecation.py:93: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <LazyFrame [12 cols, {"round_id": Int64 … "velocity": List(Utf8)}] at 0x7F1F14467A90>

    @deprecate_renamed_parameter(
        "common_subplan_elimination", "comm_subplan_elim", version="0.18.9"
    )
    def collect(
        self,
        *,
        type_coercion: bool = True,
        predicate_pushdown: bool = True,
        projection_pushdown: bool = True,
        simplify_expression: bool = True,
        no_optimization: bool = False,
        slice_pushdown: bool = True,
        comm_subplan_elim: bool = True,
        comm_subexpr_elim: bool = True,
        streaming: bool = False,
    ) -> DataFrame:
        """
        Collect into a DataFrame.

        Note: use :func:`fetch` if you want to run your query on the first `n` rows
        only. This can be a huge time saver in debugging queries.

        Parameters
        ----------
        type_coercion
            Do type coercion optimization.
        predicate_pushdown
            Do predicate pushdown optimization.
        projection_pushdown
            Do projection pushdown optimization.
        simplify_expression
            Run simplify expressions optimization.
        no_optimization
            Turn off (certain) optimizations.
        slice_pushdown
            Slice pushdown optimization.
        comm_subplan_elim
            Will try to cache branching subplans that occur on self-joins or unions.
        comm_subexpr_elim
            Common subexpressions will be cached and reused.
        streaming
            Run parts of the query in a streaming fashion (this is in an alpha state)

        Returns
        -------
        DataFrame

        Examples
        --------
        >>> lf = pl.LazyFrame(
        ...     {
        ...         "a": ["a", "b", "a", "b", "b", "c"],
        ...         "b": [1, 2, 3, 4, 5, 6],
        ...         "c": [6, 5, 4, 3, 2, 1],
        ...     }
        ... )
        >>> lf.groupby("a", maintain_order=True).agg(pl.all().sum()).collect()
        shape: (3, 3)
        ┌─────┬─────┬─────┐
        │ a   ┆ b   ┆ c   │
        │ --- ┆ --- ┆ --- │
        │ str ┆ i64 ┆ i64 │
        ╞═════╪═════╪═════╡
        │ a   ┆ 4   ┆ 10  │
        │ b   ┆ 11  ┆ 10  │
        │ c   ┆ 6   ┆ 1   │
        └─────┴─────┴─────┘

        """
        if no_optimization:
            predicate_pushdown = False
            projection_pushdown = False
            slice_pushdown = False
            comm_subplan_elim = False
            comm_subexpr_elim = False

        if streaming:
            comm_subplan_elim = False

        ldf = self._ldf.optimization_toggle(
            type_coercion,
            predicate_pushdown,
            projection_pushdown,
            simplify_expression,
            slice_pushdown,
            comm_subplan_elim,
            comm_subexpr_elim,
            streaming,
        )
>       return wrap_df(ldf.collect())
E       exceptions.ComputeError: arithmetic on string and numeric not allowed, try an explicit cast first

comm_subexpr_elim = False
comm_subplan_elim = False
ldf        = <builtins.PyLazyFrame object at 0x7f1f14821bb0>
no_optimization = True
predicate_pushdown = False
projection_pushdown = False
self       = <LazyFrame [12 cols, {"round_id": Int64 … "velocity": List(Utf8)}] at 0x7F1F14467A90>
simplify_expression = True
slice_pushdown = False
streaming  = False
type_coercion = True

/mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/polars/lazyframe/frame.py:1695: ComputeError
---------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------
Downloading https://figshare.com/ndownloader/files/27039812 to /tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing1/GazeBase/downloads/GazeBase_v2_0.zip
Checking integrity of GazeBase_v2_0.zip
Extracting GazeBase_v2_0.zip to /tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing1/GazeBase/raw
---------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------
GazeBase_v2_0.zip: 100%|██████████| 6.25G/6.25G [03:27<00:00, 32.4MB/s]   
100%|██████████| 12334/12334 [08:12<00:00, 25.06it/s]
  0%|          | 22/12334 [00:01<10:37, 19.32it/s]
______________________________________________________________________________________ test_public_dataset_processing[SBSAT] ______________________________________________________________________________________

dataset_name = 'SBSAT', tmp_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing5')

    @pytest.mark.parametrize(
        'dataset_name',
        list(pm.dataset.DatasetLibrary.definitions.keys()),
    )
    def test_public_dataset_processing(dataset_name, tmp_path):
        # Initialize dataset.
        dataset_path = tmp_path / dataset_name
        dataset = pm.Dataset(dataset_name, path=dataset_path)

        # Download and load in dataset.
>       dataset.download()

dataset    = <pymovements.dataset.dataset.Dataset object at 0x7f205bdcfc40>
dataset_name = 'SBSAT'
dataset_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing5/SBSAT')
tmp_path   = PosixPath('/tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing5')

tests/integration/public_dataset_processing_test.py:36: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pymovements.dataset.dataset.Dataset object at 0x7f205bdcfc40>

    def download(
            self,
            *,
            extract: bool = True,
            remove_finished: bool = False,
            verbose: int = 1,
    ) -> Dataset:
        """Download dataset resources.

        This downloads all resources of the dataset. Per default this also extracts all archives
        into :py:meth:`Dataset.paths.raw`,
        To save space on your device you can remove the archive files after
        successful extraction with ``remove_finished=True``.

        If a corresponding file already exists in the local system, its checksum is calculated and
        checked against the expected checksum.
        Downloading will be evaded if the integrity of the existing file can be verified.
        If the existing file does not match the expected checksum it is overwritten with the
        downloaded new file.

        Parameters
        ----------
        extract : bool
            Extract dataset archive files.
        remove_finished : bool
            Remove archive files after extraction.
        verbose : int
            Verbosity levels: (1) Show download progress bar and print info messages on downloading
            and extracting archive files without printing messages for recursive archive extraction.
            (2) Print additional messages for each recursive archive extract.

        Raises
        ------
        AttributeError
            If number of mirrors or number of resources specified for dataset is zero.
        RuntimeError
            If downloading a resource failed for all given mirrors.

        Returns
        -------
        PublicDataset
            Returns self, useful for method cascading.
        """
>       dataset_download.download_dataset(
            definition=self.definition,
            paths=self.paths,
            extract=extract,
            remove_finished=remove_finished,
            verbose=bool(verbose),
        )

extract    = True
remove_finished = False
self       = <pymovements.dataset.dataset.Dataset object at 0x7f205bdcfc40>
verbose    = 1

src/pymovements/dataset/dataset.py:761: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

definition = SBSAT(name='SBSAT', mirrors=('https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/',), resources=({'reso...ns=['x_left', 'y_left'], position_columns=None, velocity_columns=None, acceleration_columns=None, distance_column=None)
paths = <pymovements.dataset.dataset_paths.DatasetPaths object at 0x7f205bdcfa90>, extract = True, remove_finished = False, verbose = True

    def download_dataset(
            definition: DatasetDefinition,
            paths: DatasetPaths,
            extract: bool = True,
            remove_finished: bool = False,
            verbose: bool = True,
    ) -> None:
        """Download dataset resources.

        This downloads all resources of the dataset. Per default this also extracts all archives
        into :py:meth:`Dataset.paths.raw`,
        To save space on your device you can remove the archive files after
        successful extraction with ``remove_finished=True``.

        If a corresponding file already exists in the local system, its checksum is calculated and
        checked against the expected checksum.
        Downloading will be evaded if the integrity of the existing file can be verified.
        If the existing file does not match the expected checksum it is overwritten with the
        downloaded new file.

        Parameters
        ----------
        definition:
            The dataset definition.
        paths:
            The dataset paths.
        extract : bool
            Extract dataset archive files.
        remove_finished : bool
            Remove archive files after extraction.
        verbose : bool
            If True, show progress of download and print status messages for integrity checking and
            file extraction.

        Raises
        ------
        AttributeError
            If number of mirrors or number of resources specified for dataset is zero.
        RuntimeError
            If downloading a resource failed for all given mirrors.
        """
        if len(definition.mirrors) == 0:
            raise AttributeError('number of mirrors must not be zero to download dataset')

        if len(definition.resources) == 0:
            raise AttributeError('number of resources must not be zero to download dataset')

        paths.raw.mkdir(parents=True, exist_ok=True)

        for resource in definition.resources:
            success = False

            for mirror_idx, mirror in enumerate(definition.mirrors):

                url = f'{mirror}{resource["resource"]}'

                try:
                    download_file(
                        url=url,
                        dirpath=paths.downloads,
                        filename=resource['filename'],
                        md5=resource['md5'],
                        verbose=verbose,
                    )
                    success = True

                # pylint: disable=overlapping-except
                except (URLError, OSError, RuntimeError) as error:
                    # Error downloading the resource, try next mirror
                    if mirror_idx < len(definition.mirrors) - 1:
                        print(f'Failed to download:\n{error}\nTrying next mirror.')
                    continue

                # downloading the resource was successful, we don't need to try another mirror
                break

            if not success:
>               raise RuntimeError(
                    f"downloading resource {resource['resource']} failed for all mirrors.",
                )
E               RuntimeError: downloading resource 64525979230ea6163c031267/?zip= failed for all mirrors.

definition = SBSAT(name='SBSAT', mirrors=('https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/',), resources=({'reso...ns=['x_left', 'y_left'], position_columns=None, velocity_columns=None, acceleration_columns=None, distance_column=None)
extract    = True
mirror     = 'https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/'
mirror_idx = 0
paths      = <pymovements.dataset.dataset_paths.DatasetPaths object at 0x7f205bdcfa90>
remove_finished = False
resource   = {'filename': 'csvs.zip', 'md5': '3cf074c93266b723437cf887f948c993', 'resource': '64525979230ea6163c031267/?zip='}
success    = False
url        = 'https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/64525979230ea6163c031267/?zip='
verbose    = True

src/pymovements/dataset/dataset_download.py:108: RuntimeError
---------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------
Downloading https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/64525979230ea6163c031267/?zip= to /tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing5/SBSAT/downloads/csvs.zip
Checking integrity of csvs.zip
---------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------
csvs.zip: 100%|██████████| 403M/403M [04:03<00:00, 1.74MB/s]
================================================================================================ warnings summary =================================================================================================
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
  /mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    if LooseVersion(module.__version__) < minver:

../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
  /mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    other = LooseVersion(other)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================================================= short test summary info =============================================================================================
FAILED tests/integration/public_dataset_processing_test.py::test_public_dataset_processing[GazeBase] - exceptions.ComputeError: arithmetic on string and numeric not allowed, try an explicit cast first
FAILED tests/integration/public_dataset_processing_test.py::test_public_dataset_processing[SBSAT] - RuntimeError: downloading resource 64525979230ea6163c031267/?zip= failed for all mirrors.
============================================================================== 2 failed, 6 passed, 10 warnings in 4148.91s (1:09:08) ==============================================================================

The error on GazeBase exactly reproduces #517. I will now merge #593 into this PR and see if we get rid of the error.

The fail on SB-Sat is strange though. @prassepaul do you know why that happened?

codecov[bot] commented 9 months ago

Codecov Report

All modified lines are covered by tests :white_check_mark:

Comparison is base (8836275) 100.00% compared to head (575ba37) 100.00%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #591 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 52 52 Lines 2337 2337 Branches 582 582 ========================================= Hits 2337 2337 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

dkrako commented 9 months ago

Merging #593 into this PR resolves #517:

=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.9.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/krakowczyk/workspace/pymovements, configfile: pyproject.toml
plugins: anyio-3.7.1, dash-2.11.1, lazy-fixture-0.6.3, hydra-core-1.3.2, cov-3.0.0
collected 8 items                                                                                                                                                                                                 

tests/integration/public_dataset_processing_test.py .....F..                                                                                                                                                [100%]

==================================================================================================== FAILURES =====================================================================================================
______________________________________________________________________________________ test_public_dataset_processing[SBSAT] ______________________________________________________________________________________

dataset_name = 'SBSAT', tmp_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-1/test_public_dataset_processing5')

    @pytest.mark.parametrize(
        'dataset_name',
        list(pm.dataset.DatasetLibrary.definitions.keys()),
    )
    def test_public_dataset_processing(dataset_name, tmp_path):
        # Initialize dataset.
        dataset_path = tmp_path / dataset_name
        dataset = pm.Dataset(dataset_name, path=dataset_path)

        # Download and load in dataset.
>       dataset.download()

dataset    = <pymovements.dataset.dataset.Dataset object at 0x7f1f13b22460>
dataset_name = 'SBSAT'
dataset_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-1/test_public_dataset_processing5/SBSAT')
tmp_path   = PosixPath('/tmp/pytest-of-krakowczyk/pytest-1/test_public_dataset_processing5')

tests/integration/public_dataset_processing_test.py:36: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pymovements.dataset.dataset.Dataset object at 0x7f1f13b22460>

    def download(
            self,
            *,
            extract: bool = True,
            remove_finished: bool = False,
            verbose: int = 1,
    ) -> Dataset:
        """Download dataset resources.

        This downloads all resources of the dataset. Per default this also extracts all archives
        into :py:meth:`Dataset.paths.raw`,
        To save space on your device you can remove the archive files after
        successful extraction with ``remove_finished=True``.

        If a corresponding file already exists in the local system, its checksum is calculated and
        checked against the expected checksum.
        Downloading will be evaded if the integrity of the existing file can be verified.
        If the existing file does not match the expected checksum it is overwritten with the
        downloaded new file.

        Parameters
        ----------
        extract : bool
            Extract dataset archive files.
        remove_finished : bool
            Remove archive files after extraction.
        verbose : int
            Verbosity levels: (1) Show download progress bar and print info messages on downloading
            and extracting archive files without printing messages for recursive archive extraction.
            (2) Print additional messages for each recursive archive extract.

        Raises
        ------
        AttributeError
            If number of mirrors or number of resources specified for dataset is zero.
        RuntimeError
            If downloading a resource failed for all given mirrors.

        Returns
        -------
        PublicDataset
            Returns self, useful for method cascading.
        """
>       dataset_download.download_dataset(
            definition=self.definition,
            paths=self.paths,
            extract=extract,
            remove_finished=remove_finished,
            verbose=bool(verbose),
        )

extract    = True
remove_finished = False
self       = <pymovements.dataset.dataset.Dataset object at 0x7f1f13b22460>
verbose    = 1

src/pymovements/dataset/dataset.py:761: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

definition = SBSAT(name='SBSAT', mirrors=('https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/',), resources=({'reso...ns=['x_left', 'y_left'], position_columns=None, velocity_columns=None, acceleration_columns=None, distance_column=None)
paths = <pymovements.dataset.dataset_paths.DatasetPaths object at 0x7f1f13b22220>, extract = True, remove_finished = False, verbose = True

    def download_dataset(
            definition: DatasetDefinition,
            paths: DatasetPaths,
            extract: bool = True,
            remove_finished: bool = False,
            verbose: bool = True,
    ) -> None:
        """Download dataset resources.

        This downloads all resources of the dataset. Per default this also extracts all archives
        into :py:meth:`Dataset.paths.raw`,
        To save space on your device you can remove the archive files after
        successful extraction with ``remove_finished=True``.

        If a corresponding file already exists in the local system, its checksum is calculated and
        checked against the expected checksum.
        Downloading will be evaded if the integrity of the existing file can be verified.
        If the existing file does not match the expected checksum it is overwritten with the
        downloaded new file.

        Parameters
        ----------
        definition:
            The dataset definition.
        paths:
            The dataset paths.
        extract : bool
            Extract dataset archive files.
        remove_finished : bool
            Remove archive files after extraction.
        verbose : bool
            If True, show progress of download and print status messages for integrity checking and
            file extraction.

        Raises
        ------
        AttributeError
            If number of mirrors or number of resources specified for dataset is zero.
        RuntimeError
            If downloading a resource failed for all given mirrors.
        """
        if len(definition.mirrors) == 0:
            raise AttributeError('number of mirrors must not be zero to download dataset')

        if len(definition.resources) == 0:
            raise AttributeError('number of resources must not be zero to download dataset')

        paths.raw.mkdir(parents=True, exist_ok=True)

        for resource in definition.resources:
            success = False

            for mirror_idx, mirror in enumerate(definition.mirrors):

                url = f'{mirror}{resource["resource"]}'

                try:
                    download_file(
                        url=url,
                        dirpath=paths.downloads,
                        filename=resource['filename'],
                        md5=resource['md5'],
                        verbose=verbose,
                    )
                    success = True

                # pylint: disable=overlapping-except
                except (URLError, OSError, RuntimeError) as error:
                    # Error downloading the resource, try next mirror
                    if mirror_idx < len(definition.mirrors) - 1:
                        print(f'Failed to download:\n{error}\nTrying next mirror.')
                    continue

                # downloading the resource was successful, we don't need to try another mirror
                break

            if not success:
>               raise RuntimeError(
                    f"downloading resource {resource['resource']} failed for all mirrors.",
                )
E               RuntimeError: downloading resource 64525979230ea6163c031267/?zip= failed for all mirrors.

definition = SBSAT(name='SBSAT', mirrors=('https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/',), resources=({'reso...ns=['x_left', 'y_left'], position_columns=None, velocity_columns=None, acceleration_columns=None, distance_column=None)
extract    = True
mirror     = 'https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/'
mirror_idx = 0
paths      = <pymovements.dataset.dataset_paths.DatasetPaths object at 0x7f1f13b22220>
remove_finished = False
resource   = {'filename': 'csvs.zip', 'md5': '3cf074c93266b723437cf887f948c993', 'resource': '64525979230ea6163c031267/?zip='}
success    = False
url        = 'https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/64525979230ea6163c031267/?zip='
verbose    = True

src/pymovements/dataset/dataset_download.py:108: RuntimeError
---------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------
Downloading https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/64525979230ea6163c031267/?zip= to /tmp/pytest-of-krakowczyk/pytest-1/test_public_dataset_processing5/SBSAT/downloads/csvs.zip
Checking integrity of csvs.zip
---------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------
csvs.zip: 100%|██████████| 403M/403M [03:58<00:00, 1.77MB/s]
================================================================================================ warnings summary =================================================================================================
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
  /mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    if LooseVersion(module.__version__) < minver:

../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
  /mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    other = LooseVersion(other)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================================================= short test summary info =============================================================================================
FAILED tests/integration/public_dataset_processing_test.py::test_public_dataset_processing[SBSAT] - RuntimeError: downloading resource 64525979230ea6163c031267/?zip= failed for all mirrors.
============================================================================== 1 failed, 7 passed, 10 warnings in 5418.02s (1:30:18) ==============================================================================

Also notice 5418.02s (1:30:18) as runtime, on our dgx using all cores and 60+ GB RAM.

dkrako commented 9 months ago

The problem with SBSAT should be solved in another issue. This PR is now ready for review.

We will not include integration tests in our CI (yet). This single test run took 90 minutes (with one dataset failing at download start). I rather think integration testing should be limited to once before publishing each release only.

As long as we don't solve our very high memory usage, I can do these test runs manually on our DGX via tox -e integration.