Add integration tests with mocked/scrambled HCP and DCAN/ABCD-HCP derivatives

tsalo commented 1 year ago

Summary

We recently discussed the issue of testing our HCP/DCAN ingression code outside of intermittent full runs on real data on clusters. We decided to try taking some HCP and DCAN data and make it useable as test data. This process involves (1) anonymizing any metadata, (2) scrambling or replacing the actual imaging data with random values, (3) reducing the size of the datasets by only including files we need for XCP-D, and (4) reducing the volume-wise data to only retain about 60 volumes.

Next steps

Find useable HCP-YA preprocessed subject and useable ABCD-HCP preprocessed subject.
Anonymize any metadata, including subject IDs and session IDs.
Remove any files we don't use for XCP-D. These datasets typically have ~50-60 GB of data per session, which is far too much for CircleCI or Box.
Reduce the volume-wise data to only retain about 60 volumes. We may also need to reduce the resolution of the imaging data (at least the NIFTIs).
1. Preprocessed NIfTIs
2. Preprocessed CIFTIs
3. Movement regressor files
4. Any other volume-wise files?
Bundle the datasets into tar.gz files and upload to Box.
Write integration tests for the new datasets.

mattcieslak commented 1 year ago

@tsalo I had an idea for this: what if we took one of the other testing input datasets and resampled it into the coordinates/format of HCP/DCAN/ABCD? Then we'd have an additional sanity check that the pearson coefficients should be very similar as a test

tsalo commented 1 year ago

I like that idea, but the other testing datasets have really low-quality data, so if we go that route I think we should replace them entirely.

tsalo commented 1 year ago

At minimum, I'd want to replace the ds001419 test dataset, which has fMRIPrep NIfTI and CIFTI derivatives generated and shared by the OpenNeuro team. They didn't do any QC on the data, and I didn't realize that the normalization (at minimum) was really crappy until after I had added it to the test suite.

We can replace that dataset with a PNC subject.

mattcieslak commented 1 year ago

Maybe create repo for mocking up test data

tsalo commented 1 year ago

I started working on this in https://github.com/PennLINC/xcp_d_test_data.

PennLINC / xcp_d

Add integration tests with mocked/scrambled HCP and DCAN/ABCD-HCP derivatives #915

Summary

Next steps