Open jdenhof opened 4 weeks ago
Hey @jdenhof ,
I did some research and it definitely looks like len()
should return the number of batches produced by the iteration. Does that align with your expectation?
Yes that is what I was expecting when calling len() on the experiment_datapipe.
Sounds good, I'll push a fix for this. I'll let you know when it's released.
I am using the experiment data pipe in conjunction with the experiment_dataloader and noticed that len() on an ExperimentDatapipe returns the total number of samples and does not take in account batch size. I feed the results of the len() to the random_split as the documentation says this is best practice. I believe the random_split expects the total iterations not total number of samples. I could divide this number by the batch size but I feel this may cause issues if there are left over samples.
To Reproduce
Is this expected behavior? If so how would you go about dropping the last batch if it doesn't align with batch_size. You cannot pass this to the Dataloader ask kwarg because batch_size=None is mutually exclusive with drop_last. Would it also cause problems passing this as total_samples to the random_split?
absl-py 2.1.0 aiobotocore 2.13.0 aiohttp 3.9.5 aioitertools 0.11.0 aiosignal 1.3.1 alembic 1.13.1 aniso8601 9.0.1 anndata 0.10.7 array_api_compat 1.6 async-timeout 4.0.3 attrs 23.2.0 blinker 1.8.2 botocore 1.34.106 cachetools 5.3.3 calmsize 0.1.3 cellxgene-census 1.13.1 certifi 2024.2.2 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 contourpy 1.2.1 cycler 0.12.1 Deprecated 1.2.14 docker 7.1.0 docstring_parser 0.16 entrypoints 0.4 exceptiongroup 1.2.1 filelock 3.13.1 Flask 3.0.3 fonttools 4.51.0 frozenlist 1.4.1 fsspec 2024.5.0 get-annotations 0.1.2 gitdb 4.0.11 GitPython 3.1.43 graphene 3.3 graphql-core 3.2.3 graphql-relay 3.2.0 greenlet 3.0.3 grpcio 1.64.0 gunicorn 22.0.0 h5py 3.11.0 idna 3.7 igraph 0.11.5 importlib-metadata 7.0.0 importlib_resources 6.4.0 itsdangerous 2.2.0 Jinja2 3.1.3 jmespath 1.0.1 joblib 1.4.2 jsonargparse 4.29.0 kiwisolver 1.4.5 legacy-api-wrap 1.4 leidenalg 0.10.2 lightning 2.2.5 lightning-utilities 0.11.2 llvmlite 0.42.0 Mako 1.3.5 Markdown 3.6 MarkupSafe 2.1.5 matplotlib 3.9.0 mlflow 2.13.0 mpmath 1.2.1 multidict 6.0.5 natsort 8.4.0 networkx 3.2.1 numba 0.59.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.1.105 nvidia-nvtx-cu12 12.1.105 opentelemetry-api 1.24.0 opentelemetry-sdk 1.24.0 opentelemetry-semantic-conventions 0.45b0 packaging 24.0 pandas 2.2.2 patsy 0.5.6 Pillow 9.3.0 pip 24.0 protobuf 4.25.3 psutil 5.9.8 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydot 2.0.0 pynndescent 0.5.12 pyparsing 3.1.2 python-dateutil 2.9.0.post0 pytorch-lightning 2.2.5 pytorch-memlab 0.3.0 pytorch-triton 3.0.0+45fff310c8 pytz 2024.1 PyYAML 6.0.1 querystring-parser 1.2.4 requests 2.32.2 s3fs 2024.5.0 scanpy 1.10.1 scib 1.1.5 scikit-learn 1.5.0 scikit-misc 0.3.1 scipy 1.13.0 seaborn 0.13.2 session_info 1.0.0 setuptools 53.0.0 six 1.16.0 smmap 5.0.1 somacore 1.0.10 SQLAlchemy 2.0.30 sqlparse 0.5.0 statsmodels 0.14.2 stdlib-list 0.10.0 sympy 1.12 tensorboard 2.16.2 tensorboard-data-server 0.7.2 texttable 1.7.0 threadpoolctl 3.5.0 tiledb 0.27.1 tiledbsoma 1.9.5 torch 2.4.0.dev20240521+cu121 torch-tb-profiler 0.4.3 torchaudio 2.2.0.dev20240521+cu121 torchdata 0.7.1 torchmetrics 1.4.0.post0 torchvision 0.19.0.dev20240521+cu121 tqdm 4.66.4 typeshed_client 2.5.1 typing_extensions 4.8.0 tzdata 2024.1 umap-learn 0.5.6 urllib3 1.26.18 Werkzeug 3.0.3 wrapt 1.16.0 yarl 1.9.4 zipp 3.18.2