abcd-j / data-catalog

https://data.abcd-j.de
0 stars 1 forks source link

Generating a filelist #22

Closed jsheunis closed 3 months ago

jsheunis commented 6 months ago

I realised I hadn't made this functionality part of the original code in the repo, and it's necessary for the JTrack EMA dataset.

For historical context, see https://github.com/psychoinformatics-de/sfb1451-projects-catalog/issues/47.

Here are some steps that I ran locally and it worked fine:

dir2filetabe

datalad clone https://gin.g-node.org/JTrack/EMA_Pilot
cd EMA_Pilot
# I have the source code for datalad-tabby available locally:
python ../../datalad-tabby/tools/dir2filetable.py . --output .

This generates a file files@tby-ds1.tsv in the current directory

remove and sort lines

sed -i '' '/.DS_Store/d' ./files@tby-ds1.tsv
sed -i '' '/.git/d' ./files@tby-ds1.tsv
sed -i '' '/.datalad/d' ./files@tby-ds1.tsv
sort files@tby-ds1.tsv > sorted_files@tby-ds1.tsv

afterwards make sure the file header line is the 1st line.

add urls

I created a short python script for this:

from argparse import ArgumentParser
import csv
from pathlib import Path

url_root = 'https://gin.g-node.org/JTrack/EMA_Pilot/raw'
url_version = '35a81f3643192ac512a4cd57d0a68f8ac41359e7'
fieldnames = ['path[POSIX]', 'size[bytes]', 'checksum[md5]', 'url']

if __name__ == "__main__":
    # Argument parsing and validation
    parser = ArgumentParser()
    parser.add_argument(
        "file_path", type=str, help="Path to file with filelist",
    )    
    parser.add_argument(
        "out_path", type=str, help="Path to output file",
    )    
    args = parser.parse_args()    

    with open(Path(args.file_path), encoding='utf8', newline='') as file:
        reader = csv.DictReader(file, delimiter='\t')
        out_rows = []
        for row in reader:
            row['url'] = url_root + '/' + url_version + '/' + row['path[POSIX]']
            out_rows.append(row)    

    with open(Path(args.out_path), 'w', encoding='utf8', newline='') as output_file:
            fc = csv.DictWriter(
                output_file,
                fieldnames=fieldnames,
                delimiter='\t'
            )
            fc.writerow(dict((fn,fn) for fn in fieldnames))
            fc.writerows(out_rows)

save it (add_file_urls.py) and then run it it:

python add_file_urls.py sorted_files@tby-ds1.tsv sorted_files_urls@tby-ds1.tsv
jsheunis commented 6 months ago

obviously this can be automated much more efficiently for future cases