CONP-PCNO / conp-portal

:bar_chart: The CONP data portal
https://portal.conp.ca/
MIT License
8 stars 24 forks source link

Downloading BigBrain still cause problems #408

Closed prioux closed 3 years ago

prioux commented 3 years ago

The datalad registration of the files in the BigBrain Dataset contains erroneous remotes and path. This causes datalad to attempt multiple connections that fail before it can finally get the file contents.

This can be seen by using the 'git annex whereis' command on some files. E.g:

git annex whereis full16_1000um_optbal.mnc 
whereis full16_1000um_optbal.mnc (3 copies) 
        00000000-0000-0000-0000-000000000001 -- web
        9a137376-4816-4301-afc5-63f5d0ecd36f -- eaobrien@datalad-dev.conp.ca:/data/temp-datasets/emmetaobrien/bigbrain-datalad
        df4f69b6-5f50-4558-a466-4c0b8419de52 -- emmet@emmet-VirtualBox:~/conp-dataset/projects/BigBrain

  web: ftp://bigbrain.loris.ca/BigBrainRelease.2015/3D_Volumes/Histological_Space/full16_1000um_optbal.mnc
  web: ftp://bigbrain.loris.ca/BigBrainRelease.2015/3D_Volumes/Histological_Space/mnc/full16_1000um_optbal.mnc

This shows there are two source entries referring to Emmet's own local setups (which are irrelevant to the outside world and should not be published), but also that two paths for the content on the web remote are specified: first with the old WRONG path and then with the new path under mnc/.

cmadjar commented 3 years ago

@emmetaobrien @prioux I just removed the old URLs from the dataset using a script I created. The old web URLs should be removed now.

(datalad_python3) BigBrain $ git annex whereis 3D_Volumes/Histological_Space/mnc/full16_1000um_optbal.mnc 
whereis 3D_Volumes/Histological_Space/mnc/full16_1000um_optbal.mnc (1 copy) 
    00000000-0000-0000-0000-000000000001 -- web

  web: ftp://bigbrain.loris.ca/BigBrainRelease.2015/3D_Volumes/Histological_Space/mnc/full16_1000um_optbal.mnc
ok

I also removed Emmet's remotes using:

git annex dead 9a137376-4816-4301-afc5-63f5d0ecd36f
git annex dead df4f69b6-5f50-4558-a466-4c0b8419de52

So in theory, the only URL you should see now is the correct web URL pointing to the FTP site.

cmadjar commented 3 years ago

@emmetaobrien I suspect that for many if not all our CONP dataset, there are our own remotes showing up in git annex whereis...

I will create a separate ticket for clean up of the datasets and we can split them amongst us.

emmetaobrien commented 3 years ago

https://github.com/CONP-PCNO/conp-dataset/pull/524 removes unnecessary local setups for Brainspan, celltypes, the 3 Khanlab datasets and the 3 refseq datasets.

mcgill-emc-rna-seq-experiment still needs updating, but that is in @zxenia's repository to which I do not have access; also, visual-working-memory has a non-standard set of locations:

(base) eaobrien@datalad-dev:/data/temp-datasets/emmetaobrien/conp-dataset/projects/visual-working-memory$ git annex whereis sub-01/anat/sub-01_T1w.nii.gz
whereis sub-01/anat/sub-01_T1w.nii.gz (3 copies)
        70b2cac8-9c0e-4a11-91d7-6a42162b00cd -- root@af43181574ca:/datalad/ds001634
        b152df9c-eb7e-4b80-9311-b021f018fa8a -- [s3-PUBLIC]
        bead5592-156d-4587-a634-69e6a75986e0 -- s3-PRIVATE

  s3-PUBLIC: http://openneuro.org.s3.amazonaws.com/ds001634/sub-01/anat/sub-01_T1w.nii.gz?versionId=3tBK9WlrojB9h_6CMILUvp3BX7Sa_aSr
ok
cmadjar commented 3 years ago

This has been fixed so closing the issue.