PolymathicAI / AstroCLIP

Multimodal contrastive pretraining for astronomical data
MIT License
77 stars 12 forks source link

dataset download always times out #18

Closed jere1882 closed 1 month ago

jere1882 commented 1 month ago

The instruction in Readme say to use hugging face interface to download:

dset = load_dataset('astroclip/data/dataset.py'

This always fails with a timeout error shortly after starting the download.

    raise asyncio.TimeoutError from None
asyncio.exceptions.TimeoutError
(...)
    raise FSTimeoutError from return_result
fsspec.exceptions.FSTimeoutError

there are many reports across the web about OpenSLR having timeout issues, can you please provide an alternative to download the data?

lsarra commented 1 month ago

Hi, thank you for pointing this out! I believe the problem is when downloading from different servers from HuggingFace (we are not using OpenSLR but the internal Flatiron server). I tested it with extra timeout and now it does not seem to crash anymore (see PR #19).

Otherwise, you can manually download it from https://users.flatironinstitute.org/~flanusse/astroclip_desi.1.1.5.h5, e.g.

wget  https://users.flatironinstitute.org/~flanusse/astroclip_desi.1.1.5.h5

Then, if you stored it in PATH, you could change the code in astroclip/data/dataset.py https://github.com/PolymathicAI/AstroCLIP/blob/e26c3704f32b35acc13bc462d95739ece41a23ea/astroclip/data/dataset.py#L78

to

data_dir = dl_manager.extract(PATH) 
EiffL commented 1 month ago

19 solves this issue, thanks for reporting it! And thanks @lsarra for implementing a fix