dandi / helpdesk

Repository to track help tickets from users.
3 stars 0 forks source link

Inaccurate Dandiset size #161

Closed vncntprvst closed 1 month ago

vncntprvst commented 1 month ago

Posting this following a discussion on DANDI Slack's debug channel https://dandiarchive.slack.com/archives/C011LRNU2F7/p1718041049153239

Bug description

The main page for the Mesoscale Activity Map Dataset (ID 000363) indicates a size of 53.6 GB. However, when downloading, individual NWB files that contain videos are way larger than that.

Expected behaviour

File size should be accurate

Actual behaviour

They aren't. e.g. this screenshot for one of the files: Screenshot 2024-08-18 150648

How to reproduce

Example script:

"""
Downloading the [Mesoscale Activity Map Dataset](https://doi.org/10.48324/dandi.000363/0.230822.0128).
Example usage:
    python download_map_ephys.py
"""
import os, sys
from dandi import dandiapi
import h5py
import remfile
from pynwb import NWBHDF5IO

dandiset_id = "000363" 
download_loc = "/tmp/map_ephys"
dandi_api_key = ""

client = dandiapi.DandiAPIClient()
map_ephys_ds = client.get_dandiset(dandiset_id)

print(f"Got dandiset {map_ephys_ds}")

# Iterate over subdirectories
assets = list(map_ephys_ds.get_assets())
print(f"Downloading {len(assets)} files from {dandiset_id} to {download_loc}")
for file_num, file in enumerate(assets): 
    if "probe" in file.path:
        continue
    print()
    print(f"Downloading file {file_num+1}/{len(assets)}: {file.path}, size {file.size} bytes")

    filename = file.path.split("/")[-1]
    filepath = f"{download_loc}/{file.path}"

    if os.path.exists(filepath):
        # Check the size of the file
        file_size = os.path.getsize(filepath)
        if file_size != file.size:
            print(f"Size mismatch for {filepath}: {file_size} bytes on disk, {file.size} bytes expected")
            # remove the file and re-download
            os.remove(filepath)
        else:
            print(f"File {filepath} already exists")
            continue       

    if not os.path.exists(os.path.dirname(filepath)):
        os.makedirs(os.path.dirname(filepath), exist_ok=True)    
    file = map_ephys_ds.get_asset_by_path(file.path)

    try:    
        print(f"Downloading {file.path}")
        file.download(filepath)
        print(f"Downloaded file to {filepath}")
    except Exception as e:
        print(f"Error downloading {file.path}: {e}")
print()            
print("Done downloading files.")

Your personal set up

Full environment ``` # Name Version Build Channel _libgcc_mutex 0.1 main _openmp_mutex 5.1 1_gnu aiobotocore 2.13.0 pypi_0 pypi aiohttp 3.9.5 py310h5eee18b_0 aioitertools 0.11.0 pypi_0 pypi aiosignal 1.2.0 pyhd3eb1b0_0 annotated-types 0.7.0 pypi_0 pypi anyio 4.2.0 py310h06a4308_0 argon2-cffi 21.3.0 pyhd3eb1b0_0 argon2-cffi-bindings 21.2.0 py310h7f8727e_0 argschema 2.0.2 pypi_0 pypi arrow 1.3.0 pypi_0 pypi asciitree 0.3.3 pypi_0 pypi asttokens 2.0.5 pyhd3eb1b0_0 async-lru 2.0.4 py310h06a4308_0 async-timeout 4.0.3 py310h06a4308_0 attrs 23.1.0 py310h06a4308_0 autograd 1.6.2 pypi_0 pypi babel 2.11.0 py310h06a4308_0 backports-tarfile 1.2.0 pypi_0 pypi bcrypt 4.1.3 pypi_0 pypi beautifulsoup4 4.12.2 py310h06a4308_0 bidsschematools 0.7.2 pypi_0 pypi blas 1.0 mkl bleach 4.1.0 pyhd3eb1b0_0 blessed 1.20.0 pypi_0 pypi blosc2 2.6.2 pypi_0 pypi botocore 1.34.106 pypi_0 pypi bottleneck 1.3.7 py310ha9d4c09_0 bqplot 0.12.43 pypi_0 pypi brainglobe-atlasapi 2.0.6 pypi_0 pypi brainglobe-space 1.0.2 pypi_0 pypi brainglobe-utils 0.5.0 pypi_0 pypi brainrender 2.1.9 pypi_0 pypi brotli 1.0.9 h5eee18b_8 brotli-bin 1.0.9 h5eee18b_8 brotli-python 1.0.9 py310h6a678d5_8 bs4 0.0.2 pypi_0 pypi bzip2 1.0.8 h5eee18b_6 c-ares 1.19.1 h5eee18b_0 ca-certificates 2024.3.11 h06a4308_0 ccfwidget 0.5.3 pypi_0 pypi cebra 0.4.0 pypi_0 pypi cellpose 3.0.8 pypi_0 pypi certifi 2024.6.2 py310h06a4308_0 cffi 1.16.0 py310h5eee18b_1 charset-normalizer 2.0.4 pyhd3eb1b0_0 ci-info 0.3.0 pypi_0 pypi click 8.1.7 pypi_0 pypi click-didyoumean 0.3.1 pypi_0 pypi cloudpickle 3.0.0 pypi_0 pypi colorcet 3.1.0 pypi_0 pypi comm 0.2.1 py310h06a4308_0 configobj 5.0.8 pypi_0 pypi contourpy 1.2.0 py310hdb19cb5_0 coverage 7.5.3 pypi_0 pypi cramjam 2.8.3 pypi_0 pypi cryptography 42.0.8 pypi_0 pypi curl 8.7.1 hdbd6064_0 cycler 0.11.0 pyhd3eb1b0_0 cyrus-sasl 2.1.28 h52b45da_1 cython 3.0.10 pypi_0 pypi dandi 0.62.1 pypi_0 pypi dandischema 0.10.1 pypi_0 pypi dask 2024.5.2 pypi_0 pypi dbus 1.13.18 hb2f20db_0 debugpy 1.6.7 py310h6a678d5_0 decorator 5.1.1 pyhd3eb1b0_0 defusedxml 0.7.1 pyhd3eb1b0_0 dnspython 2.6.1 pypi_0 pypi elephant 0.12.0 pypi_0 pypi email-validator 2.1.1 pypi_0 pypi et-xmlfile 1.1.0 pypi_0 pypi etelemetry 0.3.1 pypi_0 pypi exceptiongroup 1.2.0 py310h06a4308_0 executing 0.8.3 pyhd3eb1b0_0 expat 2.6.2 h6a678d5_0 fasteners 0.19 pypi_0 pypi fastremap 1.14.1 pypi_0 pypi fontconfig 2.14.1 h4c34cd2_2 fonttools 4.51.0 py310h5eee18b_0 fqdn 1.5.1 pypi_0 pypi freetype 2.12.1 h4a9f257_0 frozenlist 1.4.0 py310h5eee18b_0 fscacher 0.4.1 pypi_0 pypi fsspec 2024.6.0 pypi_0 pypi future 1.0.0 pypi_0 pypi gast 0.4.0 pypi_0 pypi gdbm 1.18 hd4cb3f1_4 gettext 0.21.0 h39681ba_1 git 2.40.1 pl5340h36fbf9e_1 glib 2.78.4 h6a678d5_0 glib-tools 2.78.4 h6a678d5_0 google 3.0.0 pypi_0 pypi gst-plugins-base 1.14.1 h6a678d5_1 gstreamer 1.14.1 h5eee18b_1 h11 0.14.0 pypi_0 pypi h5py 3.11.0 pypi_0 pypi hdmf 3.14.1 pypi_0 pypi httpcore 1.0.5 pypi_0 pypi httpx 0.27.0 pypi_0 pypi humanize 4.9.0 pypi_0 pypi icu 73.1 h6a678d5_0 idna 3.7 py310h06a4308_0 imagecodecs 2024.6.1 pypi_0 pypi imageio 2.34.1 pypi_0 pypi imio 0.3.1 pypi_0 pypi importlib-metadata 4.13.0 pypi_0 pypi iniconfig 2.0.0 pypi_0 pypi intel-openmp 2023.1.0 hdb19cb5_46306 interleave 0.2.1 pypi_0 pypi ipydatagrid 1.3.2 pypi_0 pypi ipydatawidgets 4.3.2 pypi_0 pypi ipyfilechooser 0.6.0 pypi_0 pypi ipykernel 6.28.0 py310h06a4308_0 ipympl 0.9.4 pypi_0 pypi ipython 8.20.0 py310h06a4308_0 ipython-genutils 0.2.0 pypi_0 pypi ipytree 0.2.2 pypi_0 pypi ipyvolume 0.6.3 pypi_0 pypi ipyvue 1.11.1 pypi_0 pypi ipyvuetify 1.9.4 pypi_0 pypi ipywebrtc 0.6.0 pypi_0 pypi ipywidgets 8.0.0 pypi_0 pypi isodate 0.6.1 pypi_0 pypi isoduration 20.11.0 pypi_0 pypi itk-core 5.4.0 pypi_0 pypi itk-filtering 5.4.0 pypi_0 pypi itk-meshtopolydata 0.11.0 pypi_0 pypi itk-numerics 5.4.0 pypi_0 pypi itkwidgets 0.32.4 pypi_0 pypi jaraco-classes 3.4.0 pypi_0 pypi jaraco-context 5.3.0 pypi_0 pypi jaraco-functools 4.0.1 pypi_0 pypi jax 0.4.28 pypi_0 pypi jaxlib 0.4.28 pypi_0 pypi jedi 0.18.1 py310h06a4308_1 jeepney 0.8.0 pypi_0 pypi jinja2 3.1.4 py310h06a4308_0 jmespath 1.0.1 pypi_0 pypi joblib 1.4.2 pypi_0 pypi jpeg 9e h5eee18b_1 json5 0.9.6 pyhd3eb1b0_0 jsonpointer 2.4 pypi_0 pypi jsonschema 4.19.2 py310h06a4308_0 jsonschema-specifications 2023.7.1 py310h06a4308_0 jupyter 1.0.0 pypi_0 pypi jupyter-console 6.6.3 pypi_0 pypi jupyter-lsp 2.2.0 py310h06a4308_0 jupyter_client 8.6.0 py310h06a4308_0 jupyter_core 5.5.0 py310h06a4308_0 jupyter_events 0.8.0 py310h06a4308_0 jupyter_server 2.10.0 py310h06a4308_0 jupyter_server_terminals 0.4.4 py310h06a4308_1 jupyterlab 4.2.1 pypi_0 pypi jupyterlab-server 2.27.2 pypi_0 pypi jupyterlab_pygments 0.1.2 py_0 jupyterlab_widgets 3.0.10 py310h06a4308_0 k3d 2.16.1 pypi_0 pypi keyring 25.2.1 pypi_0 pypi keyrings-alt 5.0.1 pypi_0 pypi kiwisolver 1.4.4 py310h6a678d5_0 krb5 1.20.1 h143b758_1 lazy-loader 0.4 pypi_0 pypi lcms2 2.12 h3be6417_0 ld_impl_linux-64 2.38 h1181459_1 lerc 3.0 h295c915_0 libbrotlicommon 1.0.9 h5eee18b_8 libbrotlidec 1.0.9 h5eee18b_8 libbrotlienc 1.0.9 h5eee18b_8 libclang 14.0.6 default_hc6dbbc7_1 libclang13 14.0.6 default_he11475f_1 libcups 2.4.2 h2d74bed_1 libcurl 8.7.1 h251f7ec_0 libdeflate 1.17 h5eee18b_1 libedit 3.1.20230828 h5eee18b_0 libev 4.33 h7f8727e_1 libffi 3.4.4 h6a678d5_1 libgcc-ng 11.2.0 h1234567_1 libglib 2.78.4 hdc74915_0 libgomp 11.2.0 h1234567_1 libiconv 1.16 h5eee18b_3 libllvm14 14.0.6 hdb19cb5_3 libnghttp2 1.57.0 h2d74bed_0 libpng 1.6.39 h5eee18b_0 libpq 12.17 hdbd6064_0 libsodium 1.0.18 h7b6447c_0 libssh2 1.11.0 h251f7ec_0 libstdcxx-ng 11.2.0 h1234567_1 libtiff 4.5.1 h6a678d5_0 libuuid 1.41.5 h5eee18b_0 libwebp-base 1.3.2 h5eee18b_0 libxcb 1.15 h7f8727e_0 libxkbcommon 1.0.1 h5eee18b_1 libxml2 2.10.4 hfdd30dd_2 lindi 0.3.14 pypi_0 pypi literate-dataclasses 0.0.6 pypi_0 pypi llvmlite 0.42.0 pypi_0 pypi locket 1.0.0 pypi_0 pypi loguru 0.7.2 pypi_0 pypi lz4-c 1.9.4 h6a678d5_1 markdown-it-py 3.0.0 pypi_0 pypi markupsafe 2.1.3 py310h5eee18b_0 marshmallow 3.0.0rc6 pypi_0 pypi matplotlib 3.6.2 pypi_0 pypi matplotlib-inline 0.1.6 py310h06a4308_0 mdurl 0.1.2 pypi_0 pypi meshio 5.3.5 pypi_0 pypi mistune 2.0.4 py310h06a4308_0 mkl 2023.1.0 h213fc3f_46344 mkl-service 2.4.0 py310h5eee18b_1 mkl_fft 1.3.8 py310h5eee18b_0 mkl_random 1.2.4 py310hdb19cb5_0 ml-dtypes 0.4.0 pypi_0 pypi more-itertools 10.2.0 pypi_0 pypi morphapi 0.2.5 pypi_0 pypi morphio 3.3.9 pypi_0 pypi mpl-interactions 0.22.0 pypi_0 pypi msgpack 1.0.8 pypi_0 pypi multidict 6.0.4 py310h5eee18b_0 munkres 1.1.4 pypi_0 pypi mysql 5.7.24 h721c034_2 myterial 1.2.1 pypi_0 pypi natsort 8.4.0 pypi_0 pypi nbclient 0.8.0 py310h06a4308_0 nbconvert 7.10.0 py310h06a4308_0 nbformat 5.9.2 py310h06a4308_0 ncurses 6.4 h6a678d5_0 nd2 0.10.1 pypi_0 pypi ndindex 1.8 pypi_0 pypi ndx-grayscalevolume 0.0.2 pypi_0 pypi ndx-icephys-meta 0.1.0 pypi_0 pypi ndx-spectrum 0.2.2 pypi_0 pypi neo 0.13.0 pypi_0 pypi nest-asyncio 1.6.0 py310h06a4308_0 networkx 3.3 pypi_0 pypi neurom 3.2.11 pypi_0 pypi nibabel 5.2.1 pypi_0 pypi notebook 7.2.1 pypi_0 pypi notebook-shim 0.2.3 py310h06a4308_0 nptyping 2.5.0 pypi_0 pypi numba 0.59.1 pypi_0 pypi numcodecs 0.12.1 pypi_0 pypi numexpr 2.8.7 py310h85018f9_0 numpy 1.23.5 pypi_0 pypi nvidia-cublas-cu11 11.10.3.66 pypi_0 pypi nvidia-cuda-nvrtc-cu11 11.7.99 pypi_0 pypi nvidia-cuda-runtime-cu11 11.7.99 pypi_0 pypi nvidia-cudnn-cu11 8.5.0.96 pypi_0 pypi nwbinspector 0.4.37 pypi_0 pypi nwbwidgets 0.11.3 pypi_0 pypi ome-types 0.5.1.post1 pypi_0 pypi opencv-python 4.10.0.82 pypi_0 pypi opencv-python-headless 4.10.0.82 pypi_0 pypi openjpeg 2.4.0 h3ad879b_0 openpyxl 3.1.3 pypi_0 pypi openssl 3.0.13 h7f8727e_2 ophys-nway-matching 0.6.0 pypi_0 pypi opt-einsum 3.3.0 pypi_0 pypi overrides 7.4.0 py310h06a4308_0 packaging 23.2 py310h06a4308_0 pandas 1.5.2 pypi_0 pypi pandocfilters 1.5.0 pyhd3eb1b0_0 paramiko 3.4.0 pypi_0 pypi parso 0.8.3 pyhd3eb1b0_0 partd 1.4.2 pypi_0 pypi patsy 0.5.6 pypi_0 pypi pcre2 10.42 hebb0a14_1 perl 5.34.0 h5eee18b_2 pexpect 4.8.0 pyhd3eb1b0_3 pillow 9.3.0 pypi_0 pypi pip 24.0 py310h06a4308_0 platformdirs 3.10.0 py310h06a4308_0 plotly 5.13.1 pypi_0 pypi pluggy 1.5.0 pypi_0 pypi ply 3.11 py310h06a4308_0 pooch 1.8.2 pypi_0 pypi prometheus_client 0.14.1 py310h06a4308_0 prompt-toolkit 3.0.43 py310h06a4308_0 prompt_toolkit 3.0.43 hd3eb1b0_0 psutil 5.9.0 py310h5eee18b_0 ptyprocess 0.7.0 pyhd3eb1b0_2 pure_eval 0.2.2 pyhd3eb1b0_0 py 1.11.0 pypi_0 pypi py-cpuinfo 9.0.0 pypi_0 pypi py2vega 0.6.1 pypi_0 pypi pyarrow 16.1.0 pypi_0 pypi pycparser 2.21 pyhd3eb1b0_0 pycryptodomex 3.20.0 pypi_0 pypi pydantic 2.7.3 pypi_0 pypi pydantic-compat 0.1.2 pypi_0 pypi pydantic-core 2.18.4 pypi_0 pypi pygments 2.15.1 py310h06a4308_1 pyinspect 0.1.0 pypi_0 pypi pymcubes 0.1.4 pypi_0 pypi pynacl 1.5.0 pypi_0 pypi pynajax 0.1.3 pypi_0 pypi pynapple 0.6.6 pypi_0 pypi pynrrd 1.0.0 pypi_0 pypi pynwb 2.2.0 pypi_0 pypi pyout 0.7.3 pypi_0 pypi pyparsing 3.0.9 py310h06a4308_0 pypdf2 3.0.1 pypi_0 pypi pyqt 5.15.10 py310h6a678d5_0 pyqt5-sip 12.13.0 py310h5eee18b_0 pyqtgraph 0.13.7 pypi_0 pypi pysocks 1.7.1 py310h06a4308_0 pytest 8.2.2 pypi_0 pypi pytest-cov 5.0.0 pypi_0 pypi python 3.10.14 h955ad1f_1 python-dateutil 2.9.0post0 py310h06a4308_2 python-fastjsonschema 2.16.2 py310h06a4308_0 python-json-logger 2.0.7 py310h06a4308_0 python-tzdata 2023.3 pyhd3eb1b0_0 pythreejs 2.4.2 pypi_0 pypi pytz 2024.1 py310h06a4308_0 pywavelets 1.6.0 pypi_0 pypi pyyaml 6.0.1 py310h5eee18b_0 pyzmq 25.1.2 py310h6a678d5_0 qt-main 5.15.2 h53bd1ea_10 qtconsole 5.5.2 pypi_0 pypi qtpy 2.4.1 pypi_0 pypi quantities 0.14.1 pypi_0 pypi rastermap 0.1.3 pypi_0 pypi readline 8.2 h5eee18b_0 referencing 0.30.2 py310h06a4308_0 remfile 0.1.10 pypi_0 pypi requests 2.32.2 py310h06a4308_0 resource-backed-dask-array 0.1.0 pypi_0 pypi retry 0.9.2 pypi_0 pypi rfc3339-validator 0.1.4 py310h06a4308_0 rfc3986-validator 0.1.1 py310h06a4308_0 rfc3987 1.3.8 pypi_0 pypi rich 13.7.1 pypi_0 pypi roifile 2024.5.24 pypi_0 pypi rpds-py 0.10.6 py310hb02cf49_0 ruamel-yaml 0.18.6 pypi_0 pypi ruamel-yaml-clib 0.2.8 pypi_0 pypi s3fs 2024.6.0 pypi_0 pypi sbxreader 0.2.2 pypi_0 pypi scanimage-tiff-reader 1.4.1.4 pypi_0 pypi scikit-image 0.19.3 pypi_0 pypi scikit-learn 1.5.0 pypi_0 pypi scipy 1.9.3 pypi_0 pypi secretstorage 3.3.3 pypi_0 pypi semantic-version 2.10.0 pypi_0 pypi send2trash 1.8.2 py310h06a4308_0 setuptools 69.5.1 py310h06a4308_0 simpleitk 2.3.1 pypi_0 pypi sip 6.7.12 py310h6a678d5_0 six 1.16.0 pyhd3eb1b0_1 slurmio 0.1.1 pypi_0 pypi sniffio 1.3.0 py310h06a4308_0 soupsieve 2.5 py310h06a4308_0 sqlite 3.45.3 h5eee18b_0 ssm 0.0.1 pypi_0 pypi stack_data 0.2.0 pyhd3eb1b0_0 statsmodels 0.14.0 pypi_0 pypi suite2p 0.12.1 pypi_0 pypi tables 3.9.2 pypi_0 pypi tabulate 0.9.0 pypi_0 pypi tbb 2021.8.0 hdb19cb5_0 tenacity 8.3.0 pypi_0 pypi tensortools 0.4 pypi_0 pypi terminado 0.17.1 py310h06a4308_0 threadpoolctl 3.5.0 pypi_0 pypi tifffile 2024.5.22 pypi_0 pypi tinycss2 1.2.1 py310h06a4308_0 tk 8.6.14 h39e8969_0 tomli 2.0.1 py310h06a4308_0 toolz 0.12.1 pypi_0 pypi torch 1.13.1 pypi_0 pypi tornado 6.3.3 py310h5eee18b_0 tqdm 4.66.4 pypi_0 pypi traitlets 5.7.1 py310h06a4308_0 traittypes 0.2.1 pypi_0 pypi treelib 1.7.0 pypi_0 pypi trimesh 4.4.1 pypi_0 pypi types-python-dateutil 2.9.0.20240316 pypi_0 pypi typing-extensions 4.11.0 py310h06a4308_0 typing_extensions 4.11.0 py310h06a4308_0 tzdata 2024a h04d1e81_0 unicodedata2 15.1.0 py310h5eee18b_0 uri-template 1.3.0 pypi_0 pypi urllib3 2.2.1 py310h06a4308_0 vedo 2024.5.1 pypi_0 pypi vtk 9.3.0 pypi_0 pypi wcwidth 0.2.5 pyhd3eb1b0_0 webcolors 24.6.0 pypi_0 pypi webencodings 0.5.1 py310h06a4308_1 websocket-client 1.8.0 py310h06a4308_0 wheel 0.43.0 py310h06a4308_0 widgetsnbextension 4.0.10 py310h06a4308_0 wrapt 1.16.0 pypi_0 pypi xarray 2024.3.0 pypi_0 pypi xmltodict 0.13.0 pypi_0 pypi xsdata 24.3.1 pypi_0 pypi xz 5.4.6 h5eee18b_1 yaml 0.2.5 h7b6447c_0 yarl 1.9.3 py310h5eee18b_0 zarr 2.18.2 pypi_0 pypi zarr-checksum 0.4.0 pypi_0 pypi zeromq 4.3.5 h6a678d5_0 zipp 3.19.2 pypi_0 pypi zlib 1.2.13 h5eee18b_1 zstandard 0.22.0 pypi_0 pypi zstd 1.5.5 hc292b87_2 ```
kabilar commented 1 month ago

Thanks for the report, @vncntprvst. I think this is due to the version_id not being defined in the line:

map_ephys_ds = client.get_dandiset(dandiset_id)

Based on the docs for get_dandiset, the version being connected to would be the October 12 published version:

If version_id is not specified, the RemoteDandiset’s version is set to the most recent published version if there is one, otherwise to the draft version.

Generally speaking, the August 21 published version (linked above) is 53.6 GB in total, and the October 12 published version is 65.7 TB.

vncntprvst commented 1 month ago

ooooh, that makes sense. In the publication, the link is to the August 21 version. I didn't notice the updated version 😅. Thanks for the explanation!

kabilar commented 1 month ago

Thanks @vncntprvst. We will be working on making it clearer on the Dandiset Landing Pages when there is a newer version.