Issue with dataset - Githubissues

anicolson commented 6 months ago

Hi,

Thanks for organising this challenge :)

I am having an issue with the dataset when I run the following:

from datasets import load_dataset, Sequence, Image, DatasetDict, concatenate_datasets

dataset = load_dataset("StanfordAIMI/interpret-cxr-public", token='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')
dataset_mimic = load_dataset(
    "json",
    data_files={"train": "train_mimic.json", "validation": "val_mimic.json"},
).cast_column("images", Sequence(Image()))
dataset_final = DatasetDict({"train": concatenate_datasets([dataset["train"], dataset_mimic["train"]]),
                             "validation": concatenate_datasets([dataset["validation"], dataset_mimic["validation"]])})
dataset_final.save_to_disk("/scratch3/nic261/datasets/bionlp24/task_1")

I get the following error:

Resolving data files: 100%|████████████████████████████████████████████████████| 146/146 [00:00<00:00, 211.65it/s]
Resolving data files: 100%|████████████████████████████████████████████████████| 146/146 [00:00<00:00, 155.80it/s]
Downloading data files: 100%|█████████████████████████████████████████████████████| 2/2 [00:00<00:00, 6528.10it/s]
Extracting data files: 100%|███████████████████████████████████████████████████████| 2/2 [00:00<00:00, 106.17it/s]
Generating train split: 217190 examples [00:02, 108222.08 examples/s]
Generating validation split: 5568 examples [00:00, 136714.74 examples/s]
Casting the dataset: 100%|█████████████████████████████████████| 217190/217190 [00:00<00:00, 431212.04 examples/s]
Casting the dataset: 100%|█████████████████████████████████████████| 5568/5568 [00:00<00:00, 285402.12 examples/s]
Saving the dataset (89/147 shards):  61%|██████████████▌         | 333243/550395 [02:38<01:43, 2099.41 examples/s]
Traceback (most recent call last):
  File "/apps/python/3.12.0/lib/python3.12/runpy.py", line 198, in _run_module_as_main
    return _run_code(code, main_globals, None,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apps/python/3.12.0/lib/python3.12/runpy.py", line 88, in _run_code
    exec(code, run_globals)
  File "/scratch3/nic261/.vscode_server/virga-i1/.vscode-server/extensions/ms-python.python-2024.2.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/scratch3/nic261/.vscode_server/virga-i1/.vscode-server/extensions/ms-python.python-2024.2.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/scratch3/nic261/.vscode_server/virga-i1/.vscode-server/extensions/ms-python.python-2024.2.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/scratch3/nic261/.vscode_server/virga-i1/.vscode-server/extensions/ms-python.python-2024.2.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
    return _run_module_code(code, init_globals, run_name,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch3/nic261/.vscode_server/virga-i1/.vscode-server/extensions/ms-python.python-2024.2.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/scratch3/nic261/.vscode_server/virga-i1/.vscode-server/extensions/ms-python.python-2024.2.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "/datasets/work/hb-mlaifsp-mm/work/repositories/transmodal/cxrmate2/bionlp24/task_1/create_dataset.py", line 13, in <module>
    dataset_final.save_to_disk("/scratch3/nic261/datasets/bionlp24/task_1")
  File "/apps/python/3.12.0/lib/python3.12/site-packages/datasets/dataset_dict.py", line 1276, in save_to_disk
    dataset.save_to_disk(
  File "/apps/python/3.12.0/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 1530, in save_to_disk
    for job_id, done, content in Dataset._save_to_disk_single(**kwargs):
  File "/apps/python/3.12.0/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 1563, in _save_to_disk_single
    writer.write_table(pa_table)
  File "/apps/python/3.12.0/lib/python3.12/site-packages/datasets/arrow_writer.py", line 575, in write_table
    pa_table = embed_table_storage(pa_table)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apps/python/3.12.0/lib/python3.12/site-packages/datasets/table.py", line 2311, in embed_table_storage
    embed_array_storage(table[name], feature) if require_storage_embed(feature) else table[name]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apps/python/3.12.0/lib/python3.12/site-packages/datasets/table.py", line 1834, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apps/python/3.12.0/lib/python3.12/site-packages/datasets/table.py", line 2206, in embed_array_storage
    casted_values = _e(array.values, feature.feature)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apps/python/3.12.0/lib/python3.12/site-packages/datasets/table.py", line 1836, in wrapper
    return func(array, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apps/python/3.12.0/lib/python3.12/site-packages/datasets/table.py", line 2180, in embed_array_storage
    return feature.embed_storage(array)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/apps/python/3.12.0/lib/python3.12/site-packages/datasets/features/image.py", line 276, in embed_storage
    storage = pa.StructArray.from_arrays([bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null())
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/array.pxi", line 3205, in pyarrow.lib.StructArray.from_arrays
  File "pyarrow/array.pxi", line 3645, in pyarrow.lib.c_mask_inverted_from_obj
TypeError: Mask must be a pyarrow.Array of type boolean

The contents of the dataset directory at the moment:

dataset_dict.json
files
make-interpret-mimic-cxr.py
mimic-cxr-2.0.0-metadata.csv
mimic-cxr-2.0.0-split.csv
mimic_cxr_sectioned.csv
train
train_mimic.json
val_mimic.json

`make-interpret-mimic-cxr.py, e.g.,:

nic261@EXPONENTIAL-DP ~ % ssh virga head -n 25 /scratch3/nic261/datasets/bionlp24/task_1/val_mimic.json
[
  {
    "findings": "",
    "images": [
      "files/p10/p10052926/s51808680/e1ab22aa-40416991-f233c1d3-bfcde28d-7612f5ac.jpg",
      "files/p10/p10052926/s51808680/660087a4-dbb7294b-fc377cbe-0eb9c4e7-9291ecd5.jpg"
    ],
    "images_path": [
      "files/p10/p10052926/s51808680/e1ab22aa-40416991-f233c1d3-bfcde28d-7612f5ac.jpg",
      "files/p10/p10052926/s51808680/660087a4-dbb7294b-fc377cbe-0eb9c4e7-9291ecd5.jpg"
    ],
    "impression": "In comparison with the study of ___, the left costophrenic angle is clear. There may be minimal basilar atelectatic changes. The cardiac silhouette is mildly enlarged for at the upper limits of normal in size and there is some tortuosity of the aorta. No definite vascular congestion or acute focal pneumonia.",
    "source": "MIMIC-CXR"
  },
  {
    "findings": "Comparison is made to previous study from ___. Pacemaker generator is seen in the right lower chest. Heart size is within normal limits. There are no areas of focal consolidation. There is no pulmonary edema or pneumothoraces. There is minimal atelectasis at the left lung base.",
    "images": [
      "files/p10/p10686640/s59711992/6ae8f16e-102463de-217fd35c-d5b45992-ea791117.jpg"
    ],
    "images_path": [
      "files/p10/p10686640/s59711992/6ae8f16e-102463de-217fd35c-d5b45992-ea791117.jpg"
    ],
    "impression": "",
    "source": "MIMIC-CXR"
  },

Any idea what caused this?

StevenSong commented 6 months ago

It looks like it's this issue: https://github.com/huggingface/datasets/issues/5717

The suggested solution in that issue unfortunately doesn't fix it for me. I'll detail some more of my digging over in the datasets repo.

Though honestly it might not be worth waiting for the fix as it seems the only benefit to saving the concatenated dataset is maybe some minor data loading speedup and reducing these lines:

dataset = load_dataset("StanfordAIMI/interpret-cxr-public")

dataset_mimic = load_dataset(
    "json",
    data_files={
        "train": "train_mimic.json",
        "validation": "val_mimic.json",
    },
).cast_column("images", Sequence(Image()))
dataset_final = DatasetDict(
    {
        "train": concatenate_datasets([dataset["train"], dataset_mimic["train"]]),
        "validation": concatenate_datasets(
            [dataset["validation"], dataset_mimic["validation"]]
        ),
    }
)

to just this line:

dataset = load_dataset("/path/to/concatenated/dataset")

anicolson commented 6 months ago

It looks like it's this issue: huggingface/datasets#5717

The suggested solution in that issue unfortunately doesn't fix it for me. I'll detail some more of my digging over in the datasets repo.

Though honestly it might not be worth waiting for the fix as it seems the only benefit to saving the concatenated dataset is maybe some minor data loading speedup and reducing these lines:
dataset = load_dataset("StanfordAIMI/interpret-cxr-public")

dataset_mimic = load_dataset(
    "json",
    data_files={
        "train": "train_mimic.json",
        "validation": "val_mimic.json",
    },
).cast_column("images", Sequence(Image()))
dataset_final = DatasetDict(
    {
        "train": concatenate_datasets([dataset["train"], dataset_mimic["train"]]),
        "validation": concatenate_datasets(
            [dataset["validation"], dataset_mimic["validation"]]
        ),
    }
)
to just this line:
dataset = load_dataset("/path/to/concatenated/dataset")

A great and obvious solution :)

jbdel commented 6 months ago

Hello,

So it looks like it starts failing when its processing the samples from mimic. Are you sure the "files" folder is a such:

files/
  p10/
    p10000032/
      s50414267/
        02aa804e-bde0afdd-112c0b34-7bc16630-4e384014.jpg
        174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg
      s53189527/
        2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab.jpg
        e084de3b-be89b11e-20fe3f9f-9c8d8dfe-4cfd202c.jpg
      s53911762/
        68b5c4b1-227d0485-9cc38c3f-7b84ab51-4b472714.jpg
        fffabebf-74fd3a1f-673b6b41-96ec0ac9-2ab69818.jpg
      s56699142/
        ea030e7a-2e3b1346-bc518786-7a8fd698-f673b44c.jpg

As in https://physionet.org/content/mimic-cxr-jpg/2.0.0/

I believe the error comes from the actual processing of the "image" field that goes from a string to an image. I run the script a few hours ago, I'am at this stage:

Saving the dataset (105/147 shards):  73%|███         | 401147/550395 [1:10:57<1:33:02, 26.73 examples/s]

So it looks like it working on my side.

Another solution would be i guess to use push_to_hub on a private repo instead but I would expect the same result

anicolson commented 6 months ago

Hi JB,

Thanks for the reply.

I double checked the structure of the files directory, it seems normal:

virga-login task_1$ ls
dataset_dict.json  files  make-interpret-mimic-cxr.py  mimic-cxr-2.0.0-chexpert.csv.gz  mimic-cxr-2.0.0-metadata.csv  mimic-cxr-2.0.0-negbio.csv.gz  mimic-cxr-2.0.0-split.csv  mimic_cxr_sectioned.csv  train  train_mimic.json  val_mimic.json
virga-login task_1$ ls files/p10/p10000032/s50414267/
02aa804e-bde0afdd-112c0b34-7bc16630-4e384014.jpg  174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg
virga-login task_1$ ls files/p10/p10000032/s53189527/
2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab.jpg  e084de3b-be89b11e-20fe3f9f-9c8d8dfe-4cfd202c.jpg

StevenSong commented 6 months ago

You're right that the error only occurs when you get to the mimic images, specifically this index (I could reproduce the error at this exact step, same as @anicolson)

Saving the dataset (89/147 shards):  61%|██████████████▌         | 333243/550395 [02:38<01:43, 2099.41 examples/s]

The issue is everything I've detailed over at https://github.com/huggingface/datasets/issues/5717. The tl;dr is that datasets is trying to read a batch of 1000 images into a pyarrow byte array which overflows. This is fine as pyarrow then chunks the byte array. The datasets function then tries to derive a boolean mask from the chunked array, but a chunked boolean array is not a valid input to the subsequent pyarrow function, hence the TypeError. Unfortunately the mechanism for controlling the batch size in datasets is not respected.

I did implement a local fix in datasets to respect the batch size. I got the concatenated dataset saved but loading the concatenated dataset takes over an hour... I think the simpler approach of concatenating and loading on the fly might be the more performant solution

jbdel commented 6 months ago

On my side, there is no problem, it does take time though: Saving the dataset (147/147 shards): 100%|██████████| 550395/550395 [7:28:42<00:00, 20.44 examples/s] working with: datasets 2.17.1

Stanford-AIMI / RRG24

Issue with dataset #2