FCP-INDI / fcp_indi_bucket_maintenance

Scripts, how tos, FAQs, and issues related to maintaining the FCP-INDI public S3 bucket.
MIT License
0 stars 0 forks source link

ADHD200 - mismatch between numbers of participants and participants.tsv #5

Open satra opened 4 years ago

satra commented 4 years ago
$ ls RawDataBIDS/*/sub-*/ses-?/anat/*T1w.nii.gz  | wc -l
     960

and if i get the participants.tsv and simply concatenate them

$ find . -name "participants.tsv" | xargs cat | wc -l
     574

suggesting that about 400 participants are not indexed in the participants.tsv.

i'm using datalad to get these files

cc/ @yarikoptic @dbkeator

@ccraddock - let us know who could fix these things assuming they are an issue.

yarikoptic commented 4 years ago

which dataset are we talking about? ADHD200 seems to not carry participants.tsv and all split per site:

$> datalad ls s3://fcp-indi/data/Projects/ADHD200/RawDataBIDS/
Connecting to bucket: fcp-indi
[INFO   ] S3 session: Connecting to the bucket fcp-indi with authentication 
Bucket info:
  Versioning: S3ResponseError: 403 Forbidden
     Website: S3ResponseError: 403 Forbidden
         ACL: S3ResponseError: 403 Forbidden
data/Projects/ADHD200/RawDataBIDS/Brown/
data/Projects/ADHD200/RawDataBIDS/KKI/
data/Projects/ADHD200/RawDataBIDS/KKI_1/
data/Projects/ADHD200/RawDataBIDS/KKI_2/
data/Projects/ADHD200/RawDataBIDS/NYU/
data/Projects/ADHD200/RawDataBIDS/NeuroIMAGE/
data/Projects/ADHD200/RawDataBIDS/OHSU/
data/Projects/ADHD200/RawDataBIDS/Peking_1/
data/Projects/ADHD200/RawDataBIDS/Peking_2/
data/Projects/ADHD200/RawDataBIDS/Peking_3/
data/Projects/ADHD200/RawDataBIDS/Pittsburgh/
data/Projects/ADHD200/RawDataBIDS/Pittsburgh_Test/
data/Projects/ADHD200/RawDataBIDS/WashU/
data/Projects/ADHD200/RawDataBIDS/du_1/
data/Projects/ADHD200/RawDataBIDS/mta_1/
data/Projects/ADHD200/RawDataBIDS/nyu_1/
satra commented 4 years ago

there is a participants.tsv per site.

yarikoptic commented 4 years ago

d'oh - didn't spot that there was find . for those files.

indeed no consistency between participants and sub- subdirectories for any site besides Peking_2,3: ```shell $> for s in *; do echo $s; grep '^[0-9]' $s/part*.tsv | wc -l; /bin/ls -ld $s/sub-* | wc -l ;done Brown 52 26 KKI 22 83 NeuroIMAGE 50 73 NYU 82 263 OHSU 68 113 Peking_1 102 136 Peking_2 67 67 Peking_3 42 42 Pittsburgh 18 98 WashU 61 60 ```

FWIW, recrawled those subdatasets -- no changes in the bucket

satra commented 4 years ago

in ADHD200 this is the only mprage file: ['RawData/Peking_3/1404738/session_1/anat_1/mprage.nii.gz'] that doesn't have a correspondence in RawDataBIDS.

the good news is that all participant ids match between BIDS and RawData. so the participants.tsv is simply missing a lot of info. we are going to pull the info the RawData phenotype files.

satra commented 4 years ago

these phenotypic csv's for rawdata are missing from the s3 bucket

Peking_2_phenotypic.csv
Peking_3_phenotypic.csv