Question about using S3 downloader as suggested by your site

milicmil commented 2 years ago

Hi,

I am trying to download a subsection of ABCD BIDS data from the collection for studies by our research centre in Toronto. We have ABCD access but I am confused by the structure of the data frame for -i command in download.py. It is not explained on the web site how to use it.

I know i need to get the links from the datasctructure_manifest.txt but when i subset the text file with the data I need, save it as csv with headers and ask download.py to use it as -i command, i keep getting key error for "manifest_name" even though it is in the data frame.

" -i S3_FILE, --input-s3 S3_FILE Path to the .csv file downloaded from the NDA containing s3 links for all subjects and their derivatives."

https://github.com/ABCD-STUDY/nda-abcd-s3-downloader

ericearl commented 2 years ago

@milicmil Can you share the exact error and the exact command line call?

milicmil commented 2 years ago

Hi,

for more context this is the command I ran (in the folder where download.py was located)

python download.py -iderivatives_func_motion_task-MID2.csv -dmilos_subset.txt -o/external/rprshnas01/external_data/abcd/ABCD_BIDS/functional_task

milos_subset.txt contained only "derivatives.func.motion_task-MID" from data_subsets.txt found in the github repo for the downloader. derivatives_func_motion_task-MID2.csv contains the following columns submission_id | dataset_id | submission_id.1 | manifest_name | manifest_file_name | associated_file

They are found in datastructure_manifest.txt that was downloaded from NDA for study 3165 as described in the readme.md in nda-abcd-s3-downloader

the rows in derivatives_func_motion_task-MID2.csv had only rows that had "derivatives.func.motion_task-MID" as found in the "manifest_file_name" column from *datastructure_manifest.txt.

Error output

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "download.py", line 386, in <module>
    _cli()
  File "download.py", line 119, in _cli
    subject_list = get_subject_list(manifest_df, args.subject_list_file)
  File "download.py", line 153, in get_subject_list
    for manifest_name in manifest_df['manifest_name'].values:
  File "/opt/scc/conda/software/Python/3.8.5-Anaconda3-2021.03/lib/python3.8/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/opt/scc/conda/software/Python/3.8.5-Anaconda3-2021.03/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'manifest_name'

ericearl commented 2 years ago

@milicmil That repo uses a python argument parser which expects a space after each argument. So more like this:

python download.py -i derivatives_func_motion_task-MID2.csv -d milos_subset.txt -o /external/rprshnas01/external_data/abcd/ABCD_BIDS/functional_task

milicmil commented 2 years ago

Thank you so much for the feedback. I just tried it and this is the error. I can check with folks at my center about that could be wrong and if I am loading the wrong python version.

I have 2 follow up questions just to make sure I am using the correct arguments:

Would you be able to write a super simple example of a command that will download something? Is my logic in terms of composing derivatives_func_motion_task-MID2.csv correct? Do I need to keep all of the columns from datastructure_manifest.txt when I am subsetting the columns for the data I am looking for?

Thank you very much for your time and help on this,

Milos Milic

       Log folder:     /nethome/kcni/mmilic
        S3 Spreadsheet: derivatives_func_motion_task-MID2.csv
        Subjects:       All subjects
Traceback (most recent call last):
  File "/opt/scc/conda/software/Python/3.8.5-Anaconda3-2021.03/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'manifest_name'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "download.py", line 386, in <module>
    _cli()
  File "download.py", line 119, in _cli
    subject_list = get_subject_list(manifest_df, args.subject_list_file)
  File "download.py", line 153, in get_subject_list
    for manifest_name in manifest_df['manifest_name'].values:
  File "/opt/scc/conda/software/Python/3.8.5-Anaconda3-2021.03/lib/python3.8/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/opt/scc/conda/software/Python/3.8.5-Anaconda3-2021.03/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'manifest_name'

ericearl commented 2 years ago

@milicmil If I am understanding correctly, you are setting up to download by making a subset of the datastructure_manifest.txt file. You do not want to do that. You should instead take a look at the https://github.com/ABCD-STUDY/nda-abcd-s3-downloader/blob/master/data_subsets.txt file and adjust it accordingly in your downloaded copy of the repository.

For instance, if you want just the MID task motion files, I would suggest you keep the following line in the data_subset.txt file and discard the other lines as you see fit:

https://github.com/ABCD-STUDY/nda-abcd-s3-downloader/blob/master/data_subsets.txt#L54

Does that help/make sense?

milicmil commented 2 years ago

Hi,

It is currently set up like that. : milos_subset.txt contains only "derivatives.func.motion_task-MID" line. That is literally it.

arueter1 commented 2 years ago

Hi @milicmil - thanks for reaching out, and also thanks for your patience! Our lab is slammed right now. I've added some folks to this issue and we will hopefully be able to respond to it in the next few weeks after the holiday.

ericfeczko commented 2 years ago

Hey everyone! We met today to troubleshoot. There was some issues with formatting that we've resolved :) There's some documentation that could be better specified on the page as to how the datastructure_manifest.txt file should be formatted. We should make some small changes to the nda-s3-downloader documentation to clarify :)

milicmil commented 2 years ago

Dr. Feczko helped me out to resolve the issue.

Basically in the end derivatives_func_motion_task-MID2.csv was a csv file not a txt tab delimited file. download.py is looking for a tab delimited file for -i argument even though the help blurb states it wants a .csv file

        help=("Path to the .csv file downloaded from the NDA containing s3 links "
              "for all subjects and their derivatives.")

manifest_df in download.py wants a tab delimited file as indicated in line 118

manifest_df = read_csv(args.s3_file, sep='\t')

ABCD-STUDY / nda-abcd-collection-3165

Question about using S3 downloader as suggested by your site #23

Error output