Provenance of data - Githubissues

jcohenadad commented 2 years ago

Where do these data come from? I see the JSON files mention UNF. I'm just concerned about our rights to make them publicly available.

dyt811 commented 2 years ago

Good point. I will check. It is not of the UNF origin for SURE. The JSONs were fabricated out of data_testing sub-unf01's JSON data with minor adjustment for protocol only (see below compare MS01 vs UNF01 for T1w (same copy) and T2w (merely name change) .

I need to check the NII hash to closely ascertain the source but it was not from UNF and DataLad for certain. ~I believe it is of MS2015 competition publicly available data I downloaded but I need to double check.~

Confirmed to be the MS2015 competition training data set publicly available at https://smart-stats-tools.org/lesion-challenge-2015 source here: https://smart-stats-tools.org/sites/default/files/lesion_challenge/training_final_v4.zip and description here: https://smart-stats-tools.org/sites/default/files/lesion_challenge/Training_data_description.pdf

I had to fabricate the JSON and curate them into BIDS format as the original one was not organized as such so meta JSON were manufactured to get it working through ivadomed pipeline.

jcohenadad commented 2 years ago

Confirmed to be the MS2015 competition training data set publicly available at https://smart-stats-tools.org/lesion-challenge-2015 source here: https://smart-stats-tools.org/sites/default/files/lesion_challenge/training_final_v4.zip and description here: https://smart-stats-tools.org/sites/default/files/lesion_challenge/Training_data_description.pdf

ok, in that case, the provenance of the data needs to be clearly indicated alongside the data repos (eg README)

I had to fabricate the JSON and curate them into BIDS format as the original one was not organized as such so meta JSON were manufactured to get it working through ivadomed pipeline.

i strongly advise not doing this because this is cause for confusion later on. Eg: someone finds these data, then use the metadata to do something with it (eg test a segmentation pipeline, etc.), and that would be problematic because these metadata are wrong. Two options: 1) if the data is "not usable" (eg: only black volume), then I guess it's ok to have this JSON file (although i'm still not a big fan of this idea) 2) you could create a JSON from scratch by only filling the entries you absolutely need (eg: T1w), so as to minimize wrong information.

jcohenadad commented 2 years ago

Another thing: if the purpose of this dataset is to perform unit tests across variable scenarii (eg: data from one session is missing), then there is no need for such a large dataset. You can use volumes that are 4x4x4 and still run those tests. That will save a lot of time during CI (and BW, etc.).

dyt811 commented 2 years ago

Great idea. 👍 I actually didn't think too much of these data being used in other scenarios other than to test IO for ivaodmed multi-session/contrast related unit tests.

I will shortly hand craft all the JSON files with the appropriate information that I can glean from the training data set description file and remove all incorrect information to bring all JSON up to code.

This way, should we actually decide to use the dataset for any other purposes, it will be fully well documented.

I will also update the readme.md incorporate all the correct information on source data this is derived from.

Thanks for bringing this to my attention and I will ensure all future training test data are up to similar standard.

jcohenadad commented 2 years ago

thanks. My main point was about trying to come up with a lightweight dataset for CI (a few MB). This dataset you put together is an overkill if the purpose is just for unit testing.

MotionCorrect / data_multi-sessions-contrasts

Provenance of data #1