Closed anicolson closed 6 months ago
It looks like it's this issue: https://github.com/huggingface/datasets/issues/5717
The suggested solution in that issue unfortunately doesn't fix it for me. I'll detail some more of my digging over in the datasets repo.
Though honestly it might not be worth waiting for the fix as it seems the only benefit to saving the concatenated dataset is maybe some minor data loading speedup and reducing these lines:
dataset = load_dataset("StanfordAIMI/interpret-cxr-public")
dataset_mimic = load_dataset(
"json",
data_files={
"train": "train_mimic.json",
"validation": "val_mimic.json",
},
).cast_column("images", Sequence(Image()))
dataset_final = DatasetDict(
{
"train": concatenate_datasets([dataset["train"], dataset_mimic["train"]]),
"validation": concatenate_datasets(
[dataset["validation"], dataset_mimic["validation"]]
),
}
)
to just this line:
dataset = load_dataset("/path/to/concatenated/dataset")
It looks like it's this issue: huggingface/datasets#5717
The suggested solution in that issue unfortunately doesn't fix it for me. I'll detail some more of my digging over in the datasets repo.
Though honestly it might not be worth waiting for the fix as it seems the only benefit to saving the concatenated dataset is maybe some minor data loading speedup and reducing these lines:
dataset = load_dataset("StanfordAIMI/interpret-cxr-public") dataset_mimic = load_dataset( "json", data_files={ "train": "train_mimic.json", "validation": "val_mimic.json", }, ).cast_column("images", Sequence(Image())) dataset_final = DatasetDict( { "train": concatenate_datasets([dataset["train"], dataset_mimic["train"]]), "validation": concatenate_datasets( [dataset["validation"], dataset_mimic["validation"]] ), } )
to just this line:
dataset = load_dataset("/path/to/concatenated/dataset")
A great and obvious solution :)
Hello,
So it looks like it starts failing when its processing the samples from mimic. Are you sure the "files" folder is a such:
files/
p10/
p10000032/
s50414267/
02aa804e-bde0afdd-112c0b34-7bc16630-4e384014.jpg
174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg
s53189527/
2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab.jpg
e084de3b-be89b11e-20fe3f9f-9c8d8dfe-4cfd202c.jpg
s53911762/
68b5c4b1-227d0485-9cc38c3f-7b84ab51-4b472714.jpg
fffabebf-74fd3a1f-673b6b41-96ec0ac9-2ab69818.jpg
s56699142/
ea030e7a-2e3b1346-bc518786-7a8fd698-f673b44c.jpg
As in https://physionet.org/content/mimic-cxr-jpg/2.0.0/
I believe the error comes from the actual processing of the "image" field that goes from a string to an image. I run the script a few hours ago, I'am at this stage:
Saving the dataset (105/147 shards): 73%|███ | 401147/550395 [1:10:57<1:33:02, 26.73 examples/s]
So it looks like it working on my side.
Another solution would be i guess to use push_to_hub
on a private repo instead but I would expect the same result
Hi JB,
Thanks for the reply.
I double checked the structure of the files directory, it seems normal:
virga-login task_1$ ls
dataset_dict.json files make-interpret-mimic-cxr.py mimic-cxr-2.0.0-chexpert.csv.gz mimic-cxr-2.0.0-metadata.csv mimic-cxr-2.0.0-negbio.csv.gz mimic-cxr-2.0.0-split.csv mimic_cxr_sectioned.csv train train_mimic.json val_mimic.json
virga-login task_1$ ls files/p10/p10000032/s50414267/
02aa804e-bde0afdd-112c0b34-7bc16630-4e384014.jpg 174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg
virga-login task_1$ ls files/p10/p10000032/s53189527/
2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab.jpg e084de3b-be89b11e-20fe3f9f-9c8d8dfe-4cfd202c.jpg
You're right that the error only occurs when you get to the mimic images, specifically this index (I could reproduce the error at this exact step, same as @anicolson)
Saving the dataset (89/147 shards): 61%|██████████████▌ | 333243/550395 [02:38<01:43, 2099.41 examples/s]
The issue is everything I've detailed over at https://github.com/huggingface/datasets/issues/5717. The tl;dr is that datasets is trying to read a batch of 1000 images into a pyarrow byte array which overflows. This is fine as pyarrow then chunks the byte array. The datasets function then tries to derive a boolean mask from the chunked array, but a chunked boolean array is not a valid input to the subsequent pyarrow function, hence the TypeError. Unfortunately the mechanism for controlling the batch size in datasets is not respected.
I did implement a local fix in datasets to respect the batch size. I got the concatenated dataset saved but loading the concatenated dataset takes over an hour... I think the simpler approach of concatenating and loading on the fly might be the more performant solution
On my side, there is no problem, it does take time though:
Saving the dataset (147/147 shards): 100%|██████████| 550395/550395 [7:28:42<00:00, 20.44 examples/s]
working with:
datasets 2.17.1
Hi,
Thanks for organising this challenge :)
I am having an issue with the dataset when I run the following:
I get the following error:
The contents of the dataset directory at the moment:
`make-interpret-mimic-cxr.py, e.g.,:
Any idea what caused this?