SIMEXP / Repo2Data

Automatic data fetcher from the web
MIT License
7 stars 3 forks source link

repo2data retry to download even if already downloaded #18

Closed ltetrel closed 2 years ago

ltetrel commented 2 years ago

@tsalo

In order to test it, I need to use my university HPC. The processing nodes on the HPC don't have internet access, and it seems like repo2data will try to access the data repository even when there's a copy of the data available locally. I can include if/else statements throughout the book to work around this, I guess, but it would make the code less readable.

ltetrel commented 2 years ago

@tsalo As you can see here: https://github.com/SIMEXP/Repo2Data/blob/4d09e23b3966b505da29296e04060148c3516f7f/repo2data/repo2data.py#L127-L141 I am checking if a data_requirement file already exists, and if its content match the target user requirement file it will bypass the download, so obviously no internet access is required if already downloaded.

Did you change the data_requirement.json file ? Can you send me the layout of your downloaded directory, content of downloaded data_requirement.json and target data_requirement.json ?

ltetrel commented 2 years ago

I am also trying on my end with your requirement.

ltetrel commented 2 years ago

There is an error when downloading with osfclient, it might be a timeout issue. Were you able to download with repo2data entirely without error ? Unfortunately if there is an issue with the osf fetcher or with your data there is nothing more I can do...


  File "/srv/conda/envs/notebook/bin/osf", line 8, in <module>
    sys.exit(main())
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/osfclient/__main__.py", line 104, in main
    exit_code = args.func(args)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/osfclient/cli.py", line 91, in wrapper
    return_value = f(cli_args)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/osfclient/cli.py", line 167, in clone
    file_.write_to(f)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/osfclient/models/file.py", line 57, in write_to
    int(response.headers['Content-Length']))
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/requests/structures.py", line 54, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'content-length'
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/bin/repo2data", line 60, in <module>
    main()
  File "/srv/conda/envs/notebook/bin/repo2data", line 57, in main
    repo2data.install()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/repo2data/repo2data.py", line 75, in install
    ret += [Repo2DataChild(self._data_requirement_file, self._use_server).install()]
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/repo2data/repo2data.py", line 249, in install
    self._scan_dl_type()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/repo2data/repo2data.py", line 238, in _scan_dl_type
    self._osf_download()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/repo2data/repo2data.py", line 211, in _osf_download
    , self._dst_path])
  File "/srv/conda/envs/notebook/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['osf', '--project', 't8h9c', 'clone', './data/nimare-paper']' returned non-zero exit status 1.
tsalo commented 2 years ago

I got the same error when trying to download. I previously only started the download and cancelled when I thought it would run through successfully- but I was clearly wrong. It failed after ~30 minutes, I think, so it didn't go for the full hour.

I'm wondering if it would make more sense to either (1) switch to just using the googledrive command instead of OSF or (2) zip everything in a single file? I could check for the data folder and zipped file at the beginning of each book script? It won't be pretty but it might at least work...

ltetrel commented 2 years ago

This is what I would actually suggest yes, to try with gdrive. Also zip can indeed help with the download here, however repo2data will unzip just if it is the top-folder that is archived (i.e. in your case the googledrive folder). https://github.com/SIMEXP/Repo2Data/blob/4d09e23b3966b505da29296e04060148c3516f7f/repo2data/repo2data.py#L105-L107

I wanted first to check the whole directory content to unzip one-by-one each file but I was afraid it would unzip too much (like .nii.gz for example) and takes too much time.

ltetrel commented 2 years ago

Also I saw that in osf there was like a notice saying that they experience lot of spam. You may want to check with the admins just in case: image

tsalo commented 2 years ago

Thanks! I've zipped the data files into a single file and uploaded that to Google Drive, and then I replaced the OSF repo URL in the data requirement file to the Google Drive file URL. I just submitted to RoboNeuro again. 🤞

EDIT: I started a build locally (had to stop because my laptop can't run all of the analyses) and the data files looked good. The compressed files within the data folder (e.g., .nii.gz files) were still compressed.

ltetrel commented 2 years ago

I will close this issue since repo2data had the good behavior (tries to re-download because it failed)