OpenNeuroOrg / openneuro

A free and open platform for analyzing and sharing neuroimaging data
https://openneuro.org/
MIT License
112 stars 40 forks source link

Filenames with non-ascii characters confuse the database #1698

Open kousu opened 4 years ago

kousu commented 4 years ago

Describe the bug

A filename with an "unusual" (i.e. non-american) characters in it will be mis-parsed by mutation uploadFiles().

To Reproduce

  1. Get some sample data openneuro download -s 1.0.0 ds002982 ds # but watch out for #1693; verify the download was clean by grep -ri git ds | grep -v Binary
  2. Put a newline into some of the filenames:
    (cd ds;
     mv participants.json partici$'\n'pants.json
     ls -l
     cd sub-glen/anat
     cp sub-glen_T2star.json sub-glen_T2$'\n'star.json 
     cp sub-glen_T2star.nii.gz sub-glen_T2$'\n'star.nii.gz
     ls -l)
  3. Upload: openneuro upload ds # accept the 'create a new dataset' prompt
  4. Look at https://openneuro.org/datasets/$DATASET to see that there are now two sub-glen folders.

Expected behavior

Filenames should respect the names I upload them under.

Screenshots

$ openneuro download -s 1.0.0 ds002982 ds
Downloading ".bidsignore" - size 33 bytes
Downloading "CHANGES" - size 65 bytes
Downloading "README" - size 6 bytes
Downloading "dataset_description.json" - size 250 bytes
Downloading "participants.json" - size 389 bytes
Downloading "participants.tsv" - size 934 bytes
Downloading "sub-chiba750/anat/sub-chiba750_T2star.json" - size 788 bytes
Downloading "sub-glen/anat/sub-glen_T2star.json" - size 1495 bytes
Downloading "sub-chiba750/anat/sub-chiba750_T2star.nii.gz" - size 5543775 bytes
Downloading "sub-glen/anat/sub-glen_T2star.nii.gz" - size 1446307 bytes
$ grep -ri git ds | grep -v Binary
$ # got lucky, #1693 didn't occur this time
$     (cd ds;
/anat
  >      mv participants.json partici$'\n'pants.json
>      ls -l
>      cd sub-glen/anat
>      cp sub-glen_T2star.json sub-glen_T2$'\n'star.json 
>      cp sub-glen_T2star.nii.gz sub-glen_T2$'\n'star.nii.gz
>      ls -l)
total 28
-rw-r--r-- 1 kousu kousu   65 Jun 29 19:17  CHANGES
-rw-r--r-- 1 kousu kousu  250 Jun 29 19:17  dataset_description.json
-rw-r--r-- 1 kousu kousu  389 Jun 29 19:17 'partici'$'\n''pants.json'
-rw-r--r-- 1 kousu kousu  934 Jun 29 19:17  participants.tsv
-rw-r--r-- 1 kousu kousu    6 Jun 29 19:17  README
drwxr-xr-x 3 kousu kousu 4096 Jun 29 19:17  sub-chiba750
drwxr-xr-x 3 kousu kousu 4096 Jun 29 19:17  sub-glen
total 2848
-rw-r--r-- 1 kousu kousu    1495 Jun 29 19:17 'sub-glen_T2'$'\n''star.json'
-rw-r--r-- 1 kousu kousu    1495 Jun 29 19:17  sub-glen_T2star.json
-rw-r--r-- 1 kousu kousu 1446307 Jun 29 19:17 'sub-glen_T2'$'\n''star.nii.gz'
-rw-r--r-- 1 kousu kousu 1446307 Jun 29 19:17  sub-glen_T2star.nii.gz
$ 
$ openneuro upload ds
? This will create a new dataset, continue? Yes
bids-validator@1.5.3

        Summary:               Available Tasks:        Available Modalities: 
        5 Files, 5.29MB                                T2star                
        1 - Subject                                                          
        1 - Session                                                          

    If you have any questions, please post on https://neurostars.org/tags/bids.

"ds002987" created with label "ds"
Transferring "dataset_description.json" - 100% complete 
Transferring ".bidsignore" - 100% complete 
Transferring "CHANGES" - 100% complete 
Transferring "README" - 100% complete 
Transferring "partici
pants.json" - 100% complete 
Transferring "participants.tsv" - 100% complete 
Transferring "sub-chiba750/anat/sub-chiba750_T2star.json" - 100% complete 
Transferring "sub-chiba750/anat/sub-chiba750_T2star.nii.gz" - 100% complete 
Transferring "sub-glen/anat/sub-glen_T2
star.json" - 100% complete 
Transferring "sub-glen/anat/sub-glen_T2
star.nii.gz" - 100% complete 
Transferring "sub-glen/anat/sub-glen_T2star.json" - 100% complete 
Transferring "sub-glen/anat/sub-glen_T2star.nii.gz" - 100% complete 
=======================================================================
Upload Complete
To publish your dataset go to https://openneuro.org/datasets/ds002987
=======================================================================

Screenshot_2020-06-29 _test_julien - Dataset - OpenNeuro(1)

Desktop (please complete the following information):

Smartphone (please complete the following information):

Additional context

These files don't seem to be deletable.

The source of the weird '"'s that split across directory hierarchies is git ls-tree. It handles unprintable characters by quoting the whole path:

$ cd ds
$ git init
Initialized empty Git repository in /home/kousu/src/neuropoly/ds/sub-glen/ds/.git/
$ git add .
$ git commit -m "Initial commit"
[master (root-commit) 608fdd4] Initial commit
 12 files changed, 189 insertions(+)
 create mode 100644 .bidsignore
 create mode 100644 CHANGES
 create mode 100644 README
 create mode 100644 dataset_description.json
 create mode 100644 "partici\npants.json"
 create mode 100644 participants.tsv
 create mode 100644 sub-chiba750/anat/sub-chiba750_T2star.json
 create mode 100644 sub-chiba750/anat/sub-chiba750_T2star.nii.gz
 create mode 100644 "sub-glen/anat/sub-glen_T2\nstar.json"
 create mode 100644 "sub-glen/anat/sub-glen_T2\nstar.nii.gz"
 create mode 100644 sub-glen/anat/sub-glen_T2star.json
 create mode 100644 sub-glen/anat/sub-glen_T2star.nii.gz
$ git ls-tree -r HEAD
100644 blob 85e97a3a7c6b4a6d2bbb929efabd701bcc818b45    .bidsignore
100644 blob 18c58734bb03f7c2c40fcc47d37324b87f730cfd    CHANGES
100644 blob 2fd9bc41d45d58681463fa4b177bcd9fe30010b1    README
100644 blob eaf60a99b2525107297bf6daaf9b157015022302    dataset_description.json
100644 blob 7b32a531fb039bba362aee810fa870f874d21a10    "partici\npants.json"
100644 blob dfa3ea80bbc2206d6d91fa5ef954191c0e47948e    participants.tsv
100644 blob d60ab1036ae760a48b2b308559247b79e4d74835    sub-chiba750/anat/sub-chiba750_T2star.json
100644 blob 1fb9b137bd57db950a902c4ed434afe0f6e7f8b5    sub-chiba750/anat/sub-chiba750_T2star.nii.gz
100644 blob 083386e5e3e1327baf3eff646beb2716399bc4a1    "sub-glen/anat/sub-glen_T2\nstar.json"
100644 blob c6b4f696428a297f1ccea0772c52513b873aafd0    "sub-glen/anat/sub-glen_T2\nstar.nii.gz"
100644 blob 083386e5e3e1327baf3eff646beb2716399bc4a1    sub-glen/anat/sub-glen_T2star.json
100644 blob c6b4f696428a297f1ccea0772c52513b873aafd0    sub-glen/anat/sub-glen_T2star.nii.gz

I tried adding a file with an embedded tab in it, and the pubsub notification that came through updated the list with the expected filename. However, upon reloading the dataset page it too had '"'s around its path.

Other unusual characters, like \b or \a or tabs, cause a similar issue.

kousu commented 4 years ago

I tried it with arabic characters and got a different problem: it accepted the upload but the WebUI says "Incomplete Upload" and everytime I try again it reuploads the same two files -- one of which is the arabic text, but the other which is just one of the normal ones in the source dataset.

$ openneuro download -s 1.0.0 ds002982 ds
Downloading ".bidsignore" - size 33 bytes
Downloading "CHANGES" - size 65 bytes
Downloading "README" - size 6 bytes
Downloading "dataset_description.json" - size 250 bytes
Downloading "participants.json" - size 389 bytes
Downloading "participants.tsv" - size 934 bytes
Downloading "sub-chiba750/anat/sub-chiba750_T2star.json" - size 788 bytes
Downloading "sub-glen/anat/sub-glen_T2star.json" - size 1495 bytes
Downloading "sub-chiba750/anat/sub-chiba750_T2star.nii.gz" - size 5543775 bytes
Downloading "sub-glen/anat/sub-glen_T2star.nii.gz" - size 1446307 bytes
$ grep -ri git ds | grep -v Binary
ds/CHANGES:{"error": "git object command exited with non-zero return code (1)"}
ds/sub-chiba750/anat/sub-chiba750_T2star.json:{"error": "git object command exited with non-zero return code (1)"}
ds/participants.json:{"error": "git object command exited with non-zero return code (1)"}
$ openneuro download -s 1.0.0 ds002982 ds # redownload because of #1693 
Skipping present file ".bidsignore"
Downloading "CHANGES" - size 65 bytes
Skipping present file "README"
Skipping present file "dataset_description.json"
Downloading "participants.json" - size 389 bytes
Skipping present file "participants.tsv"
Downloading "sub-chiba750/anat/sub-chiba750_T2star.json" - size 788 bytes
Skipping present file "sub-glen/anat/sub-glen_T2star.json"
Skipping present file "sub-chiba750/anat/sub-chiba750_T2star.nii.gz"
Skipping present file "sub-glen/anat/sub-glen_T2star.nii.gz"
$ grep -ri git ds | grep -v Binary
$ mkdir ds/derivatives
$ echo cavabienaller > ds/derivatives/filter_علم الأعصاب.txt
$ openneuro upload ds
? This will create a new dataset, continue? Yes
bids-validator@1.5.3

        Summary:               Available Tasks:        Available Modalities: 
        5 Files, 5.29MB                                T2star                
        1 - Subject                                                          
        1 - Session                                                          

    If you have any questions, please post on https://neurostars.org/tags/bids.

"ds002988" created with label "ds"
=======================================================================
Upload Complete
To publish your dataset go to https://openneuro.org/datasets/ds002988
=======================================================================
Transferring "dataset_description.json" - 100% complete 
Transferring ".bidsignore" - 100% complete 
Transferring "CHANGES" - 100% complete 
Transferring "README" - 100% complete 
Transferring "participants.json" - 100% complete 
Transferring "participants.tsv" - 100% complete 
Transferring "derivatives/filter_علم" - 100% complete 
Transferring "sub-chiba750/anat/sub-chiba750_T2star.json" - 100% complete 
Transferring "sub-chiba750/anat/sub-chiba750_T2star.nii.gz" - 14% complete (4.5 MB remaining)
$ openneuro upload -d ds002988 ds
Adding files to "ds002988"
bids-validator@1.5.3

        Summary:               Available Tasks:        Available Modalities: 
        5 Files, 5.29MB                                T2star                
        1 - Subject                                                          
        1 - Session                                                          

    If you have any questions, please post on https://neurostars.org/tags/bids.

Skipping existing file - ".bidsignore"
Skipping existing file - "CHANGES"
Skipping existing file - "README"
Skipping existing file - "dataset_description.json"
Skipping existing file - "participants.json"
Skipping existing file - "participants.tsv"
Skipping existing file - "sub-chiba750/anat/sub-chiba750_T2star.json"
=======================================================================
Upload Complete
To publish the update go to https://openneuro.org/datasets/ds002988 and create a new snapshot
=======================================================================
Transferring "derivatives/filter_علم" - 100% complete 
Transferring "sub-chiba750/anat/sub-chiba750_T2star.nii.gz" - 14% complete (4.5 MB remaining)
$ openneuro upload -d ds002988 ds
Adding files to "ds002988"
bids-validator@1.5.3

        Summary:               Available Tasks:        Available Modalities: 
        5 Files, 5.29MB                                T2star                
        1 - Subject                                                          
        1 - Session                                                          

    If you have any questions, please post on https://neurostars.org/tags/bids.

Skipping existing file - ".bidsignore"
Skipping existing file - "CHANGES"
Skipping existing file - "README"
Skipping existing file - "dataset_description.json"
Skipping existing file - "participants.json"
Skipping existing file - "participants.tsv"
Skipping existing file - "sub-chiba750/anat/sub-chiba750_T2star.json"
=======================================================================
Upload Complete
To publish the update go to https://openneuro.org/datasets/ds002988 and create a new snapshot
=======================================================================
Transferring "derivatives/filter_علم" - 100% complete 
Transferring "sub-chiba750/anat/sub-chiba750_T2star.nii.gz" - 13% complete (4.6 MB remaining)

Screenshot from 2020-06-29 19-37-39