Closed dkrako closed 9 months ago
The GitHub workers are failing the integration tests. I see three potential reasons:
Dataset.load()
loads a complete dataset into memory, if the workers have too little working memory, they will fail.Point 1. would be solved after we implement batched preprocessing, meaning that we won't keep all the dataset files in memory, but process N
files in parallel. This way we only need to keep N files in memory.
If it's Point 2 or 3 then we have a problem.
I'm running the integration tests locally on our DGX, let's see what we get as output from that.
We have one expected fail on GazeBase and one unexpected fail with SB-Sat.
Output
=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.9.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/krakowczyk/workspace/pymovements, configfile: pyproject.toml
plugins: anyio-3.7.1, dash-2.11.1, lazy-fixture-0.6.3, hydra-core-1.3.2, cov-3.0.0
collected 8 items
tests/integration/public_dataset_processing_test.py .F...F.. [100%]
==================================================================================================== FAILURES =====================================================================================================
____________________________________________________________________________________ test_public_dataset_processing[GazeBase] _____________________________________________________________________________________
dataset_name = 'GazeBase', tmp_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing1')
@pytest.mark.parametrize(
'dataset_name',
list(pm.dataset.DatasetLibrary.definitions.keys()),
)
def test_public_dataset_processing(dataset_name, tmp_path):
# Initialize dataset.
dataset_path = tmp_path / dataset_name
dataset = pm.Dataset(dataset_name, path=dataset_path)
# Download and load in dataset.
dataset.download()
dataset.load()
# Do some basic transformations.
if 'pixel' in dataset.gaze[0].columns:
dataset.pix2deg()
> dataset.pos2vel()
dataset = <pymovements.dataset.dataset.Dataset object at 0x7f205c77dd30>
dataset_name = 'GazeBase'
dataset_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing1/GazeBase')
tmp_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing1')
tests/integration/public_dataset_processing_test.py:42:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pymovements.dataset.dataset.Dataset object at 0x7f205c77dd30>, method = 'fivepoint', verbose = True, kwargs = {}
def pos2vel(
self,
method: str = 'fivepoint',
*,
verbose: bool = True,
**kwargs: Any,
) -> Dataset:
"""Compute gaze velocites in dva/s from dva coordinates.
This method requires a properly initialized :py:attr:`~.Dataset.experiment` attribute.
After success, the gaze dataframe is extended by the resulting velocity columns.
Parameters
----------
method : str
Computation method. See :func:`~transforms.pos2vel()` for details, default: smooth.
verbose : bool
If True, show progress of computation.
**kwargs
Additional keyword arguments to be passed to the :func:`~transforms.pos2vel()` method.
Raises
------
AttributeError
If `gaze` is None or there are no gaze dataframes present in the `gaze` attribute, or
if experiment is None.
Returns
-------
Dataset
Returns self, useful for method cascading.
"""
> return self.apply('pos2vel', method=method, verbose=verbose, **kwargs)
kwargs = {}
method = 'fivepoint'
self = <pymovements.dataset.dataset.Dataset object at 0x7f205c77dd30>
verbose = True
src/pymovements/dataset/dataset.py:393:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pymovements.dataset.dataset.Dataset object at 0x7f205c77dd30>, function = 'pos2vel', verbose = True, kwargs = {'method': 'fivepoint'}, disable_progressbar = False
gaze = <pymovements.gaze.gaze_dataframe.GazeDataFrame object at 0x7f205cbf8460>
def apply(
self,
function: str,
*,
verbose: bool = True,
**kwargs: Any,
) -> Dataset:
"""Apply preprocessing method to all GazeDataFrames in Dataset.
Parameters
----------
function: str
Name of the preprocessing function to apply.
verbose : bool
If True, show progress bar of computation.
kwargs:
kwargs that will be forwarded when calling the preprocessing method.
Examples
--------
Let's load in our dataset first,
>>> import pymovements as pm
>>>
>>> dataset = pm.Dataset("ToyDataset", path='toy_dataset')
>>> dataset.download()# doctest:+ELLIPSIS
Downloading ... to toy_dataset...downloads...
Checking integrity of ...
Extracting ... to toy_dataset...raw
<pymovements.dataset.dataset.Dataset object at ...>
>>> dataset.load()# doctest:+ELLIPSIS
<pymovements.dataset.dataset.Dataset object at ...>
Use apply for your gaze transformations:
>>> dataset.apply('pix2deg')# doctest:+ELLIPSIS
<pymovements.dataset.dataset.Dataset object at ...>
>>> dataset.apply('pos2vel', method='neighbors')# doctest:+ELLIPSIS
<pymovements.dataset.dataset.Dataset object at ...>
Use apply for your event detection:
>>> dataset.apply('ivt')# doctest:+ELLIPSIS
<pymovements.dataset.dataset.Dataset object at ...>
>>> dataset.apply('microsaccades', minimum_duration=8)# doctest:+ELLIPSIS
<pymovements.dataset.dataset.Dataset object at ...>
"""
self._check_gaze_dataframe()
disable_progressbar = not verbose
for gaze in tqdm(self.gaze, disable=disable_progressbar):
> gaze.apply(function, **kwargs)
disable_progressbar = False
function = 'pos2vel'
gaze = <pymovements.gaze.gaze_dataframe.GazeDataFrame object at 0x7f205cbf8460>
kwargs = {'method': 'fivepoint'}
self = <pymovements.dataset.dataset.Dataset object at 0x7f205c77dd30>
verbose = True
src/pymovements/dataset/dataset.py:287:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pymovements.gaze.gaze_dataframe.GazeDataFrame object at 0x7f205cbf8460>, function = 'pos2vel', kwargs = {'method': 'fivepoint'}
def apply(
self,
function: str,
**kwargs: Any,
) -> None:
"""Apply preprocessing method to GazeDataFrame.
Parameters
----------
function: str
Name of the preprocessing function to apply.
kwargs:
kwargs that will be forwarded when calling the preprocessing method.
"""
if transforms.TransformLibrary.__contains__(function):
> self.transform(function, **kwargs)
function = 'pos2vel'
kwargs = {'method': 'fivepoint'}
self = <pymovements.gaze.gaze_dataframe.GazeDataFrame object at 0x7f205cbf8460>
src/pymovements/gaze/gaze_dataframe.py:252:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pymovements.gaze.gaze_dataframe.GazeDataFrame object at 0x7f205cbf8460>, transform_method = <function pos2vel at 0x7f1fbeda29d0>
kwargs = {'method': 'fivepoint', 'n_components': 2, 'sampling_rate': 1000}, method_kwargs = ['sampling_rate', 'method', 'n_components', 'degree', 'window_length', 'padding', ...]
def transform(
self,
transform_method: str | Callable[..., pl.Expr],
**kwargs: Any,
) -> None:
"""Apply transformation method."""
if isinstance(transform_method, str):
transform_method = transforms.TransformLibrary.get(transform_method)
if transform_method.__name__ == 'downsample':
downsample_factor = kwargs.pop('factor')
self.frame = self.frame.select(
transforms.downsample(
factor=downsample_factor, **kwargs,
),
)
else:
method_kwargs = inspect.getfullargspec(transform_method).kwonlyargs
if 'origin' in method_kwargs and 'origin' not in kwargs:
self._check_experiment()
assert self.experiment is not None
kwargs['origin'] = self.experiment.screen.origin
if 'screen_resolution' in method_kwargs and 'screen_resolution' not in kwargs:
self._check_experiment()
assert self.experiment is not None
kwargs['screen_resolution'] = (
self.experiment.screen.width_px, self.experiment.screen.height_px,
)
if 'screen_size' in method_kwargs and 'screen_size' not in kwargs:
self._check_experiment()
assert self.experiment is not None
kwargs['screen_size'] = (
self.experiment.screen.width_cm, self.experiment.screen.height_cm,
)
if 'distance' in method_kwargs and 'distance' not in kwargs:
self._check_experiment()
assert self.experiment is not None
if 'distance' in self.frame.columns:
kwargs['distance'] = 'distance'
if self.experiment.screen.distance_cm:
warnings.warn(
"Both a distance column and experiment's "
'eye-to-screen distance are specified. '
'Using eye-to-screen distances from column '
"'distance' in the dataframe.",
)
elif self.experiment.screen.distance_cm:
kwargs['distance'] = self.experiment.screen.distance_cm
else:
raise AttributeError(
'Neither eye-to-screen distance is in the columns of the dataframe '
'nor experiment eye-to-screen distance is specified.',
)
if 'sampling_rate' in method_kwargs and 'sampling_rate' not in kwargs:
self._check_experiment()
assert self.experiment is not None
kwargs['sampling_rate'] = self.experiment.sampling_rate
if 'n_components' in method_kwargs and 'n_components' not in kwargs:
self._check_n_components()
kwargs['n_components'] = self.n_components
if transform_method.__name__ in {'pos2vel', 'pos2acc'}:
if 'position' not in self.frame.columns and 'position_column' not in kwargs:
if 'pixel' in self.frame.columns:
raise pl.exceptions.ColumnNotFoundError(
"Neither 'position' is in the columns of the dataframe: "
f'{self.frame.columns} nor is the position column specified. '
"Since the dataframe has a 'pixel' column, consider running "
f'pix2deg() before {transform_method.__name__}(). If you want '
'to calculate pixel transformations, you can do so by using '
f"{transform_method.__name__}(position_column='pixel'). "
f'Available dataframe columns are {self.frame.columns}',
)
raise pl.exceptions.ColumnNotFoundError(
"Neither 'position' is in the columns of the dataframe: "
f'{self.frame.columns} nor is the position column specified. '
f'Available dataframe columns are {self.frame.columns}',
)
if transform_method.__name__ in {'pix2deg'}:
if 'pixel' not in self.frame.columns and 'pixel_column' not in kwargs:
raise pl.exceptions.ColumnNotFoundError(
"Neither 'position' is in the columns of the dataframe: "
f'{self.frame.columns} nor is the pixel column specified. '
'You can specify the pixel column via: '
f'{transform_method.__name__}(pixel_column="name_of_your_pixel_column"). '
f'Available dataframe columns are {self.frame.columns}',
)
if self.trial_columns is None:
self.frame = self.frame.with_columns(transform_method(**kwargs))
else:
self.frame = pl.concat(
> [
df.with_columns(transform_method(**kwargs))
for group, df in self.frame.groupby(self.trial_columns, maintain_order=True)
],
)
kwargs = {'method': 'fivepoint', 'n_components': 2, 'sampling_rate': 1000}
method_kwargs = ['sampling_rate', 'method', 'n_components', 'degree', 'window_length', 'padding', ...]
self = <pymovements.gaze.gaze_dataframe.GazeDataFrame object at 0x7f205cbf8460>
transform_method = <function pos2vel at 0x7f1fbeda29d0>
src/pymovements/gaze/gaze_dataframe.py:358:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.0 = <polars.dataframe.groupby.GroupBy object at 0x7f204f4ded00>
[
> df.with_columns(transform_method(**kwargs))
for group, df in self.frame.groupby(self.trial_columns, maintain_order=True)
],
)
.0 = <polars.dataframe.groupby.GroupBy object at 0x7f204f4ded00>
df = shape: (15_076, 11)
┌──────────┬────────────┬────────────┬───────────┬───┬──────────────┬──────┬─────┬────────────────...948"] │
└──────────┴────────────┴────────────┴───────────┴───┴──────────────┴──────┴─────┴───────────────────────────┘
group = (1, 2, 2, 'FXS')
kwargs = {'method': 'fivepoint', 'n_components': 2, 'sampling_rate': 1000}
transform_method = <function pos2vel at 0x7f1fbeda29d0>
src/pymovements/gaze/gaze_dataframe.py:359:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = shape: (15_076, 11)
┌──────────┬────────────┬────────────┬───────────┬───┬──────────────┬──────┬─────┬────────────────...948"] │
└──────────┴────────────┴────────────┴───────────┴───┴──────────────┴──────┴─────┴───────────────────────────┘
exprs = (<polars.expr.expr.Expr object at 0x7f205cbf8910>,), named_exprs = {}
def with_columns(
self,
*exprs: IntoExpr | Iterable[IntoExpr],
**named_exprs: IntoExpr,
) -> DataFrame:
"""
Add columns to this DataFrame.
Added columns will replace existing columns with the same name.
Parameters
----------
*exprs
Column(s) to add, specified as positional arguments.
Accepts expression input. Strings are parsed as column names, other
non-expression inputs are parsed as literals.
**named_exprs
Additional columns to add, specified as keyword arguments.
The columns will be renamed to the keyword used.
Returns
-------
DataFrame
A new DataFrame with the columns added.
Notes
-----
Creating a new DataFrame using this method does not create a new copy of
existing data.
Examples
--------
Pass an expression to add it as a new column.
>>> df = pl.DataFrame(
... {
... "a": [1, 2, 3, 4],
... "b": [0.5, 4, 10, 13],
... "c": [True, True, False, True],
... }
... )
>>> df.with_columns((pl.col("a") ** 2).alias("a^2"))
shape: (4, 4)
┌─────┬──────┬───────┬──────┐
│ a ┆ b ┆ c ┆ a^2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ f64 │
╞═════╪══════╪═══════╪══════╡
│ 1 ┆ 0.5 ┆ true ┆ 1.0 │
│ 2 ┆ 4.0 ┆ true ┆ 4.0 │
│ 3 ┆ 10.0 ┆ false ┆ 9.0 │
│ 4 ┆ 13.0 ┆ true ┆ 16.0 │
└─────┴──────┴───────┴──────┘
Added columns will replace existing columns with the same name.
>>> df.with_columns(pl.col("a").cast(pl.Float64))
shape: (4, 3)
┌─────┬──────┬───────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ bool │
╞═════╪══════╪═══════╡
│ 1.0 ┆ 0.5 ┆ true │
│ 2.0 ┆ 4.0 ┆ true │
│ 3.0 ┆ 10.0 ┆ false │
│ 4.0 ┆ 13.0 ┆ true │
└─────┴──────┴───────┘
Multiple columns can be added by passing a list of expressions.
>>> df.with_columns(
... [
... (pl.col("a") ** 2).alias("a^2"),
... (pl.col("b") / 2).alias("b/2"),
... (pl.col("c").is_not()).alias("not c"),
... ]
... )
shape: (4, 6)
┌─────┬──────┬───────┬──────┬──────┬───────┐
│ a ┆ b ┆ c ┆ a^2 ┆ b/2 ┆ not c │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ f64 ┆ f64 ┆ bool │
╞═════╪══════╪═══════╪══════╪══════╪═══════╡
│ 1 ┆ 0.5 ┆ true ┆ 1.0 ┆ 0.25 ┆ false │
│ 2 ┆ 4.0 ┆ true ┆ 4.0 ┆ 2.0 ┆ false │
│ 3 ┆ 10.0 ┆ false ┆ 9.0 ┆ 5.0 ┆ true │
│ 4 ┆ 13.0 ┆ true ┆ 16.0 ┆ 6.5 ┆ false │
└─────┴──────┴───────┴──────┴──────┴───────┘
Multiple columns also can be added using positional arguments instead of a list.
>>> df.with_columns(
... (pl.col("a") ** 2).alias("a^2"),
... (pl.col("b") / 2).alias("b/2"),
... (pl.col("c").is_not()).alias("not c"),
... )
shape: (4, 6)
┌─────┬──────┬───────┬──────┬──────┬───────┐
│ a ┆ b ┆ c ┆ a^2 ┆ b/2 ┆ not c │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ f64 ┆ f64 ┆ bool │
╞═════╪══════╪═══════╪══════╪══════╪═══════╡
│ 1 ┆ 0.5 ┆ true ┆ 1.0 ┆ 0.25 ┆ false │
│ 2 ┆ 4.0 ┆ true ┆ 4.0 ┆ 2.0 ┆ false │
│ 3 ┆ 10.0 ┆ false ┆ 9.0 ┆ 5.0 ┆ true │
│ 4 ┆ 13.0 ┆ true ┆ 16.0 ┆ 6.5 ┆ false │
└─────┴──────┴───────┴──────┴──────┴───────┘
Use keyword arguments to easily name your expression inputs.
>>> df.with_columns(
... ab=pl.col("a") * pl.col("b"),
... not_c=pl.col("c").is_not(),
... )
shape: (4, 5)
┌─────┬──────┬───────┬──────┬───────┐
│ a ┆ b ┆ c ┆ ab ┆ not_c │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ f64 ┆ bool │
╞═════╪══════╪═══════╪══════╪═══════╡
│ 1 ┆ 0.5 ┆ true ┆ 0.5 ┆ false │
│ 2 ┆ 4.0 ┆ true ┆ 8.0 ┆ false │
│ 3 ┆ 10.0 ┆ false ┆ 30.0 ┆ true │
│ 4 ┆ 13.0 ┆ true ┆ 52.0 ┆ false │
└─────┴──────┴───────┴──────┴───────┘
Expressions with multiple outputs can be automatically instantiated as Structs
by enabling the experimental setting ``Config.set_auto_structify(True)``:
>>> with pl.Config(auto_structify=True):
... df.drop("c").with_columns(
... diffs=pl.col(["a", "b"]).diff().suffix("_diff"),
... )
...
shape: (4, 3)
┌─────┬──────┬─────────────┐
│ a ┆ b ┆ diffs │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ struct[2] │
╞═════╪══════╪═════════════╡
│ 1 ┆ 0.5 ┆ {null,null} │
│ 2 ┆ 4.0 ┆ {1,3.5} │
│ 3 ┆ 10.0 ┆ {1,6.0} │
│ 4 ┆ 13.0 ┆ {1,3.0} │
└─────┴──────┴─────────────┘
"""
return (
> self.lazy()
.with_columns(*exprs, **named_exprs)
.collect(no_optimization=True)
)
exprs = (<polars.expr.expr.Expr object at 0x7f205cbf8910>,)
named_exprs = {}
self = shape: (15_076, 11)
┌──────────┬────────────┬────────────┬───────────┬───┬──────────────┬──────┬─────┬────────────────...948"] │
└──────────┴────────────┴────────────┴───────────┴───┴──────────────┴──────┴─────┴───────────────────────────┘
/mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/polars/dataframe/frame.py:7631:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
args = (<LazyFrame [12 cols, {"round_id": Int64 … "velocity": List(Utf8)}] at 0x7F1F14467A90>,), kwargs = {'no_optimization': True}
@wraps(function)
def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
_rename_keyword_argument(
old_name, new_name, kwargs, function.__name__, version
)
> return function(*args, **kwargs)
args = (<LazyFrame [12 cols, {"round_id": Int64 … "velocity": List(Utf8)}] at 0x7F1F14467A90>,)
function = <function LazyFrame.collect at 0x7f205fa84c10>
kwargs = {'no_optimization': True}
new_name = 'comm_subplan_elim'
old_name = 'common_subplan_elimination'
version = '0.18.9'
/mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/polars/utils/deprecation.py:93:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <LazyFrame [12 cols, {"round_id": Int64 … "velocity": List(Utf8)}] at 0x7F1F14467A90>
@deprecate_renamed_parameter(
"common_subplan_elimination", "comm_subplan_elim", version="0.18.9"
)
def collect(
self,
*,
type_coercion: bool = True,
predicate_pushdown: bool = True,
projection_pushdown: bool = True,
simplify_expression: bool = True,
no_optimization: bool = False,
slice_pushdown: bool = True,
comm_subplan_elim: bool = True,
comm_subexpr_elim: bool = True,
streaming: bool = False,
) -> DataFrame:
"""
Collect into a DataFrame.
Note: use :func:`fetch` if you want to run your query on the first `n` rows
only. This can be a huge time saver in debugging queries.
Parameters
----------
type_coercion
Do type coercion optimization.
predicate_pushdown
Do predicate pushdown optimization.
projection_pushdown
Do projection pushdown optimization.
simplify_expression
Run simplify expressions optimization.
no_optimization
Turn off (certain) optimizations.
slice_pushdown
Slice pushdown optimization.
comm_subplan_elim
Will try to cache branching subplans that occur on self-joins or unions.
comm_subexpr_elim
Common subexpressions will be cached and reused.
streaming
Run parts of the query in a streaming fashion (this is in an alpha state)
Returns
-------
DataFrame
Examples
--------
>>> lf = pl.LazyFrame(
... {
... "a": ["a", "b", "a", "b", "b", "c"],
... "b": [1, 2, 3, 4, 5, 6],
... "c": [6, 5, 4, 3, 2, 1],
... }
... )
>>> lf.groupby("a", maintain_order=True).agg(pl.all().sum()).collect()
shape: (3, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ a ┆ 4 ┆ 10 │
│ b ┆ 11 ┆ 10 │
│ c ┆ 6 ┆ 1 │
└─────┴─────┴─────┘
"""
if no_optimization:
predicate_pushdown = False
projection_pushdown = False
slice_pushdown = False
comm_subplan_elim = False
comm_subexpr_elim = False
if streaming:
comm_subplan_elim = False
ldf = self._ldf.optimization_toggle(
type_coercion,
predicate_pushdown,
projection_pushdown,
simplify_expression,
slice_pushdown,
comm_subplan_elim,
comm_subexpr_elim,
streaming,
)
> return wrap_df(ldf.collect())
E exceptions.ComputeError: arithmetic on string and numeric not allowed, try an explicit cast first
comm_subexpr_elim = False
comm_subplan_elim = False
ldf = <builtins.PyLazyFrame object at 0x7f1f14821bb0>
no_optimization = True
predicate_pushdown = False
projection_pushdown = False
self = <LazyFrame [12 cols, {"round_id": Int64 … "velocity": List(Utf8)}] at 0x7F1F14467A90>
simplify_expression = True
slice_pushdown = False
streaming = False
type_coercion = True
/mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/polars/lazyframe/frame.py:1695: ComputeError
---------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------
Downloading https://figshare.com/ndownloader/files/27039812 to /tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing1/GazeBase/downloads/GazeBase_v2_0.zip
Checking integrity of GazeBase_v2_0.zip
Extracting GazeBase_v2_0.zip to /tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing1/GazeBase/raw
---------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------
GazeBase_v2_0.zip: 100%|██████████| 6.25G/6.25G [03:27<00:00, 32.4MB/s]
100%|██████████| 12334/12334 [08:12<00:00, 25.06it/s]
0%| | 22/12334 [00:01<10:37, 19.32it/s]
______________________________________________________________________________________ test_public_dataset_processing[SBSAT] ______________________________________________________________________________________
dataset_name = 'SBSAT', tmp_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing5')
@pytest.mark.parametrize(
'dataset_name',
list(pm.dataset.DatasetLibrary.definitions.keys()),
)
def test_public_dataset_processing(dataset_name, tmp_path):
# Initialize dataset.
dataset_path = tmp_path / dataset_name
dataset = pm.Dataset(dataset_name, path=dataset_path)
# Download and load in dataset.
> dataset.download()
dataset = <pymovements.dataset.dataset.Dataset object at 0x7f205bdcfc40>
dataset_name = 'SBSAT'
dataset_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing5/SBSAT')
tmp_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing5')
tests/integration/public_dataset_processing_test.py:36:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pymovements.dataset.dataset.Dataset object at 0x7f205bdcfc40>
def download(
self,
*,
extract: bool = True,
remove_finished: bool = False,
verbose: int = 1,
) -> Dataset:
"""Download dataset resources.
This downloads all resources of the dataset. Per default this also extracts all archives
into :py:meth:`Dataset.paths.raw`,
To save space on your device you can remove the archive files after
successful extraction with ``remove_finished=True``.
If a corresponding file already exists in the local system, its checksum is calculated and
checked against the expected checksum.
Downloading will be evaded if the integrity of the existing file can be verified.
If the existing file does not match the expected checksum it is overwritten with the
downloaded new file.
Parameters
----------
extract : bool
Extract dataset archive files.
remove_finished : bool
Remove archive files after extraction.
verbose : int
Verbosity levels: (1) Show download progress bar and print info messages on downloading
and extracting archive files without printing messages for recursive archive extraction.
(2) Print additional messages for each recursive archive extract.
Raises
------
AttributeError
If number of mirrors or number of resources specified for dataset is zero.
RuntimeError
If downloading a resource failed for all given mirrors.
Returns
-------
PublicDataset
Returns self, useful for method cascading.
"""
> dataset_download.download_dataset(
definition=self.definition,
paths=self.paths,
extract=extract,
remove_finished=remove_finished,
verbose=bool(verbose),
)
extract = True
remove_finished = False
self = <pymovements.dataset.dataset.Dataset object at 0x7f205bdcfc40>
verbose = 1
src/pymovements/dataset/dataset.py:761:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
definition = SBSAT(name='SBSAT', mirrors=('https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/',), resources=({'reso...ns=['x_left', 'y_left'], position_columns=None, velocity_columns=None, acceleration_columns=None, distance_column=None)
paths = <pymovements.dataset.dataset_paths.DatasetPaths object at 0x7f205bdcfa90>, extract = True, remove_finished = False, verbose = True
def download_dataset(
definition: DatasetDefinition,
paths: DatasetPaths,
extract: bool = True,
remove_finished: bool = False,
verbose: bool = True,
) -> None:
"""Download dataset resources.
This downloads all resources of the dataset. Per default this also extracts all archives
into :py:meth:`Dataset.paths.raw`,
To save space on your device you can remove the archive files after
successful extraction with ``remove_finished=True``.
If a corresponding file already exists in the local system, its checksum is calculated and
checked against the expected checksum.
Downloading will be evaded if the integrity of the existing file can be verified.
If the existing file does not match the expected checksum it is overwritten with the
downloaded new file.
Parameters
----------
definition:
The dataset definition.
paths:
The dataset paths.
extract : bool
Extract dataset archive files.
remove_finished : bool
Remove archive files after extraction.
verbose : bool
If True, show progress of download and print status messages for integrity checking and
file extraction.
Raises
------
AttributeError
If number of mirrors or number of resources specified for dataset is zero.
RuntimeError
If downloading a resource failed for all given mirrors.
"""
if len(definition.mirrors) == 0:
raise AttributeError('number of mirrors must not be zero to download dataset')
if len(definition.resources) == 0:
raise AttributeError('number of resources must not be zero to download dataset')
paths.raw.mkdir(parents=True, exist_ok=True)
for resource in definition.resources:
success = False
for mirror_idx, mirror in enumerate(definition.mirrors):
url = f'{mirror}{resource["resource"]}'
try:
download_file(
url=url,
dirpath=paths.downloads,
filename=resource['filename'],
md5=resource['md5'],
verbose=verbose,
)
success = True
# pylint: disable=overlapping-except
except (URLError, OSError, RuntimeError) as error:
# Error downloading the resource, try next mirror
if mirror_idx < len(definition.mirrors) - 1:
print(f'Failed to download:\n{error}\nTrying next mirror.')
continue
# downloading the resource was successful, we don't need to try another mirror
break
if not success:
> raise RuntimeError(
f"downloading resource {resource['resource']} failed for all mirrors.",
)
E RuntimeError: downloading resource 64525979230ea6163c031267/?zip= failed for all mirrors.
definition = SBSAT(name='SBSAT', mirrors=('https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/',), resources=({'reso...ns=['x_left', 'y_left'], position_columns=None, velocity_columns=None, acceleration_columns=None, distance_column=None)
extract = True
mirror = 'https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/'
mirror_idx = 0
paths = <pymovements.dataset.dataset_paths.DatasetPaths object at 0x7f205bdcfa90>
remove_finished = False
resource = {'filename': 'csvs.zip', 'md5': '3cf074c93266b723437cf887f948c993', 'resource': '64525979230ea6163c031267/?zip='}
success = False
url = 'https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/64525979230ea6163c031267/?zip='
verbose = True
src/pymovements/dataset/dataset_download.py:108: RuntimeError
---------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------
Downloading https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/64525979230ea6163c031267/?zip= to /tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing5/SBSAT/downloads/csvs.zip
Checking integrity of csvs.zip
---------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------
csvs.zip: 100%|██████████| 403M/403M [04:03<00:00, 1.74MB/s]
================================================================================================ warnings summary =================================================================================================
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
/mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
if LooseVersion(module.__version__) < minver:
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
/mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
other = LooseVersion(other)
-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================================================= short test summary info =============================================================================================
FAILED tests/integration/public_dataset_processing_test.py::test_public_dataset_processing[GazeBase] - exceptions.ComputeError: arithmetic on string and numeric not allowed, try an explicit cast first
FAILED tests/integration/public_dataset_processing_test.py::test_public_dataset_processing[SBSAT] - RuntimeError: downloading resource 64525979230ea6163c031267/?zip= failed for all mirrors.
============================================================================== 2 failed, 6 passed, 10 warnings in 4148.91s (1:09:08) ==============================================================================
The error on GazeBase exactly reproduces #517. I will now merge #593 into this PR and see if we get rid of the error.
The fail on SB-Sat is strange though. @prassepaul do you know why that happened?
All modified lines are covered by tests :white_check_mark:
Comparison is base (
8836275
) 100.00% compared to head (575ba37
) 100.00%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Merging #593 into this PR resolves #517:
=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.9.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/krakowczyk/workspace/pymovements, configfile: pyproject.toml
plugins: anyio-3.7.1, dash-2.11.1, lazy-fixture-0.6.3, hydra-core-1.3.2, cov-3.0.0
collected 8 items
tests/integration/public_dataset_processing_test.py .....F.. [100%]
==================================================================================================== FAILURES =====================================================================================================
______________________________________________________________________________________ test_public_dataset_processing[SBSAT] ______________________________________________________________________________________
dataset_name = 'SBSAT', tmp_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-1/test_public_dataset_processing5')
@pytest.mark.parametrize(
'dataset_name',
list(pm.dataset.DatasetLibrary.definitions.keys()),
)
def test_public_dataset_processing(dataset_name, tmp_path):
# Initialize dataset.
dataset_path = tmp_path / dataset_name
dataset = pm.Dataset(dataset_name, path=dataset_path)
# Download and load in dataset.
> dataset.download()
dataset = <pymovements.dataset.dataset.Dataset object at 0x7f1f13b22460>
dataset_name = 'SBSAT'
dataset_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-1/test_public_dataset_processing5/SBSAT')
tmp_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-1/test_public_dataset_processing5')
tests/integration/public_dataset_processing_test.py:36:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pymovements.dataset.dataset.Dataset object at 0x7f1f13b22460>
def download(
self,
*,
extract: bool = True,
remove_finished: bool = False,
verbose: int = 1,
) -> Dataset:
"""Download dataset resources.
This downloads all resources of the dataset. Per default this also extracts all archives
into :py:meth:`Dataset.paths.raw`,
To save space on your device you can remove the archive files after
successful extraction with ``remove_finished=True``.
If a corresponding file already exists in the local system, its checksum is calculated and
checked against the expected checksum.
Downloading will be evaded if the integrity of the existing file can be verified.
If the existing file does not match the expected checksum it is overwritten with the
downloaded new file.
Parameters
----------
extract : bool
Extract dataset archive files.
remove_finished : bool
Remove archive files after extraction.
verbose : int
Verbosity levels: (1) Show download progress bar and print info messages on downloading
and extracting archive files without printing messages for recursive archive extraction.
(2) Print additional messages for each recursive archive extract.
Raises
------
AttributeError
If number of mirrors or number of resources specified for dataset is zero.
RuntimeError
If downloading a resource failed for all given mirrors.
Returns
-------
PublicDataset
Returns self, useful for method cascading.
"""
> dataset_download.download_dataset(
definition=self.definition,
paths=self.paths,
extract=extract,
remove_finished=remove_finished,
verbose=bool(verbose),
)
extract = True
remove_finished = False
self = <pymovements.dataset.dataset.Dataset object at 0x7f1f13b22460>
verbose = 1
src/pymovements/dataset/dataset.py:761:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
definition = SBSAT(name='SBSAT', mirrors=('https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/',), resources=({'reso...ns=['x_left', 'y_left'], position_columns=None, velocity_columns=None, acceleration_columns=None, distance_column=None)
paths = <pymovements.dataset.dataset_paths.DatasetPaths object at 0x7f1f13b22220>, extract = True, remove_finished = False, verbose = True
def download_dataset(
definition: DatasetDefinition,
paths: DatasetPaths,
extract: bool = True,
remove_finished: bool = False,
verbose: bool = True,
) -> None:
"""Download dataset resources.
This downloads all resources of the dataset. Per default this also extracts all archives
into :py:meth:`Dataset.paths.raw`,
To save space on your device you can remove the archive files after
successful extraction with ``remove_finished=True``.
If a corresponding file already exists in the local system, its checksum is calculated and
checked against the expected checksum.
Downloading will be evaded if the integrity of the existing file can be verified.
If the existing file does not match the expected checksum it is overwritten with the
downloaded new file.
Parameters
----------
definition:
The dataset definition.
paths:
The dataset paths.
extract : bool
Extract dataset archive files.
remove_finished : bool
Remove archive files after extraction.
verbose : bool
If True, show progress of download and print status messages for integrity checking and
file extraction.
Raises
------
AttributeError
If number of mirrors or number of resources specified for dataset is zero.
RuntimeError
If downloading a resource failed for all given mirrors.
"""
if len(definition.mirrors) == 0:
raise AttributeError('number of mirrors must not be zero to download dataset')
if len(definition.resources) == 0:
raise AttributeError('number of resources must not be zero to download dataset')
paths.raw.mkdir(parents=True, exist_ok=True)
for resource in definition.resources:
success = False
for mirror_idx, mirror in enumerate(definition.mirrors):
url = f'{mirror}{resource["resource"]}'
try:
download_file(
url=url,
dirpath=paths.downloads,
filename=resource['filename'],
md5=resource['md5'],
verbose=verbose,
)
success = True
# pylint: disable=overlapping-except
except (URLError, OSError, RuntimeError) as error:
# Error downloading the resource, try next mirror
if mirror_idx < len(definition.mirrors) - 1:
print(f'Failed to download:\n{error}\nTrying next mirror.')
continue
# downloading the resource was successful, we don't need to try another mirror
break
if not success:
> raise RuntimeError(
f"downloading resource {resource['resource']} failed for all mirrors.",
)
E RuntimeError: downloading resource 64525979230ea6163c031267/?zip= failed for all mirrors.
definition = SBSAT(name='SBSAT', mirrors=('https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/',), resources=({'reso...ns=['x_left', 'y_left'], position_columns=None, velocity_columns=None, acceleration_columns=None, distance_column=None)
extract = True
mirror = 'https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/'
mirror_idx = 0
paths = <pymovements.dataset.dataset_paths.DatasetPaths object at 0x7f1f13b22220>
remove_finished = False
resource = {'filename': 'csvs.zip', 'md5': '3cf074c93266b723437cf887f948c993', 'resource': '64525979230ea6163c031267/?zip='}
success = False
url = 'https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/64525979230ea6163c031267/?zip='
verbose = True
src/pymovements/dataset/dataset_download.py:108: RuntimeError
---------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------
Downloading https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/64525979230ea6163c031267/?zip= to /tmp/pytest-of-krakowczyk/pytest-1/test_public_dataset_processing5/SBSAT/downloads/csvs.zip
Checking integrity of csvs.zip
---------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------
csvs.zip: 100%|██████████| 403M/403M [03:58<00:00, 1.77MB/s]
================================================================================================ warnings summary =================================================================================================
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
/mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
if LooseVersion(module.__version__) < minver:
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
/mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
other = LooseVersion(other)
-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================================================= short test summary info =============================================================================================
FAILED tests/integration/public_dataset_processing_test.py::test_public_dataset_processing[SBSAT] - RuntimeError: downloading resource 64525979230ea6163c031267/?zip= failed for all mirrors.
============================================================================== 1 failed, 7 passed, 10 warnings in 5418.02s (1:30:18) ==============================================================================
Also notice 5418.02s (1:30:18)
as runtime, on our dgx using all cores and 60+ GB RAM.
The problem with SBSAT should be solved in another issue. This PR is now ready for review.
We will not include integration tests in our CI (yet). This single test run took 90 minutes (with one dataset failing at download start). I rather think integration testing should be limited to once before publishing each release only.
As long as we don't solve our very high memory usage, I can do these test runs manually on our DGX via tox -e integration
.
Description
A first version to try out downloading and processing public datasets
This should fail until #517 is fixed
We have to find some solution such that integration tests are only tested very seldomly. (what about only when publishing new releases?)
For now I would just add
--ignore=^tests/integration
under[tool.pytest.ini_options]
inpyproject.toml
, because we definitely do not want to preprocess all datasets for each commit, because this would take forever (even if they are cached)