aces / Loris-MRI

The set of scripts that preprocess and insert MRI data into the database.
10 stars 50 forks source link

[EEG] extract archive script #970

Closed laemtl closed 1 year ago

laemtl commented 1 year ago

This PR requires the mysql patch in LORIS #8685

Test cases:

laemtl commented 1 year ago

what happens if people do not have the Config EEGS3DataPath? Maybe you could add an if statement early in the script to check if the config exists, if not, exit with error?

The script checks if the config exists and is set otherwise {dataDirBasepath}/assembly_bids is used.

cmadjar commented 1 year ago

@laemtl I encountered a few issues when testing. See below:

(loris-mri-py39) lorisadmin@cecile-dev:/opt/loris/bin/mri/python$ python extract_eeg_bids_archive.py -p database_config.py -u 120
Downloading s3://eegarchive/PIUMN0013_967379_V03_bids.tar.gz from https://s3.msi.umn.edu/loris_demo_test_cmadjar to /data/tmp/extract_eeg_bids_archive_2023-05-05_12h50m52s_6w8_g9w6/PIUMN0013_967379_V03_bids.tar.gz

[ERROR   ] s3://eegarchive/PIUMN0013_967379_V03_bids.tar.gz could not be downloaded from S3 bucket. Error was
s3://eegarchive/PIUMN0013_967379_V03_bids.tar.gz download failure = An error occurred (404) when calling the HeadObject operation: Not Found

However, the file exists in my s3:

(loris-mri-py39) lorisadmin@cecile-dev:/opt/loris/bin/mri/python$ s3cmd -c ~/s3config_files/msi_cecile_sandbox_s3.s3cfg ls s3://eegarchive/PIUMN0013_967379_V03_bids.tar.gz
2023-05-05 12:45 725.6496305465698M  s3://eegarchive/PIUMN0013_967379_V03_bids.tar.gz

Could it be because the default bucket is pointing to the S3 bucket with the assembly_bids files? Do I need to add a config in my database_config.py file?

Is there a patch somewhere to add the enum? Also, in general, we try to avoid enum in LORIS so we are not depending on doing a major or minor release for a data change. Instead, it might be best to create a separate table with the list of status and use the ID of the statuses in electrophysiology_uploader. Then later on, if you want to add a new status, this is just an INSERT statement instead of an ALTER TABLE.

Executing query:
    UPDATE electrophysiology_uploader SET Status = 'Extracted' WHERE UploadLocation = %s
With arguments:
    ('PIUMN0013_967379_V03_bids.tar.gz',)

Traceback (most recent call last):
  File "/opt/loris/bin/mri/python/lib/database.py", line 213, in update
    cursor.execute(query, args)
  File "/home/lorisadmin/miniconda3/envs/loris-mri-py39/lib/python3.9/site-packages/MySQLdb/cursors.py", line 206, in execute
    res = self._query(query)
  File "/home/lorisadmin/miniconda3/envs/loris-mri-py39/lib/python3.9/site-packages/MySQLdb/cursors.py", line 319, in _query
    db.query(q)
  File "/home/lorisadmin/miniconda3/envs/loris-mri-py39/lib/python3.9/site-packages/MySQLdb/connections.py", line 254, in query
    _mysql.connection.query(self, query)
MySQLdb.DataError: (1265, "Data truncated for column 'Status' at row 120")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/loris/bin/mri/python/extract_eeg_bids_archive.py", line 217, in <module>
    main()
  File "/opt/loris/bin/mri/python/extract_eeg_bids_archive.py", line 210, in main
    db.update(
  File "/opt/loris/bin/mri/python/lib/database.py", line 216, in update
    raise Exception("Update query failure: " + format(err))
Exception: Update query failure: (1265, "Data truncated for column 'Status' at row 120")

Once I modified the ENUM status column, I rerun the script and the extraction happens, however, it extracts it with the PSCID instead of the CandID which will be an issue for HBCD as all data are under CandID. Is there a config to use the CandID when extracting instead of the PSCID? If so, need to document that for deployment

The script still prints:

Uploading s3://loris_demo_f/sub-PIUMN0013/ses-V03/eeg/sub-PIUMN0013_ses-V03_task-MMN_acq-eeg_eeg.set to https://s3.msi.umn.edu/loris_demo_test_cmadjar
Uploading s3://loris_demo_f/sub-PIUMN0013/ses-V03/eeg/sub-PIUMN0013_ses-V03_task-MMN_acq-eeg_channels.tsv to https://s3.msi.umn.edu/loris_demo_test_cmadjar

However, nothing get uploaded since s3://loris_demo_f/ does not exist.

Also, seems like the message Uploading <wrong bucket name> to <default bucket name> is misleading since in theory, it uploads the content of the tmp directory where the data was extracted into the S3 in this case.

cmadjar commented 1 year ago

FYI: list of tests performed:

cmadjar commented 1 year ago

@laemtl I just remembered about the permissions of the extract script. See point 5 in https://github.com/aces/Loris-MRI/pull/970#issuecomment-1536244577

laemtl commented 1 year ago

Once I modified the ENUM status column, I rerun the script and the extraction happens, however, it extracts it with the PSCID instead of the CandID which will be an issue for HBCD as all data are under CandID. Is there a config to use the CandID when extracting instead of the PSCID? If so, need to document that for deployment

Extracting using the CandID will be possible if the BIDS dataset is set to use the CandId as the BIDs subject_id. The dataset is probably not setup this way. To cover the conversion of subject-id for the previously uploaded dataset, another script can be created.

laemtl commented 1 year ago

Is there a patch somewhere to add the enum? Also, in general, we try to avoid enum in LORIS so we are not depending on doing a major or minor release for a data change. Instead, it might be best to create a separate table with the list of status and use the ID of the statuses in electrophysiology_uploader. Then later on, if you want to add a new status, this is just an INSERT statement instead of an ALTER TABLE.

Yes, here is the link to the PR for the patch: https://github.com/aces/HBCD/pull/895 I will keep this at an enum for now since it makes it easier to handle DEFAULT values. We can rediscuss later about that :)