bids-standard / bids-specification

Brain Imaging Data Structure (BIDS) Specification
https://bids-specification.readthedocs.io/
Creative Commons Attribution 4.0 International
276 stars 159 forks source link

acquisition date anonymization #538

Open guiomar opened 4 years ago

guiomar commented 4 years ago

Hi!

After discussion at: https://github.com/mne-tools/mne-study-template/pull/160#

From the BIDS specs:

"For anonymization purposes all dates within one subject should be shifted by a randomly chosen (but consistent across all runs etc.) number of days. This way relative timing would be preserved, but chances of identifying a person based on the date and time of their scan would be decreased. Dates that are shifted for anonymization purposes should be set to a year 1925 or earlier to clearly distinguish them from unmodified data. Shifting dates is RECOMMENDED, but not required."

Summarizing:

Since currently the spec remains a bit open, and people could still use earlier than 1925 dates, we should at least warn in the specs, that this may produce errors when working with FIF files (e.g. for M/EEG).

Alternatively, talking to Marc he suggested to use an additional field which indicate wether a dated was randomized. This will potentially end these discussions around the acq dates issue. I think this is an elegant solution. I wonder if there is a reason why it has not been already implemented?

The possible solutions I see are: 1) Impose 1925 (remove "or earlier" from the specs) 2) Add a warning on the specs to indicate that earlier dates may crash FIF files when working with M/EEG 3) Add an additional field to indicate whether a date was randomized and keep usual acq date ranges

What do you think?

@hoechenberger @jasmainak @agramfort @sappelhoff @alexrockhill @robertoostenveld @Moo-Marc

Moo-Marc commented 4 years ago

As Guio mentioned, my suggestion is to remove the specific date recommendation and simply add a flag like "AnonymizedAcqTime = true" in dataset_description.json.

jasmainak commented 4 years ago

yes, I think that's a nice solution. I think it's also worth mentioning that fif files do form a big market in the MEG community (not sure what percentage), not just a format that needs to be fixed. I believe there are hurdles for that to happen any time soon.

sappelhoff commented 4 years ago

simply add a flag like "AnonymizedAcqTime = true" in dataset_description.json.

Can somebody explain in detail how that would help, please?

jasmainak commented 4 years ago

Can somebody explain in detail how that would help, please?

The problem is fif files cannot have dates before 1901. If you remove the restriction of 1925, it makes it much easier to anonymize. You only need to anonymize up to a year rather than figure out a crazy big number that is not too big.

Moo-Marc commented 4 years ago

You no longer have to look at the year to know if it's a real date or not. And it removes the unnecessary restriction. You can now have a real or anonymized date of 2000-01-01 for example.

sappelhoff commented 4 years ago

Ahhh okay thanks! Let me rephrase:

Sounds good to me in principle. Let's collect more opinions and if we converge, we can do a small PR!

effigies commented 4 years ago

tl;dr This discussion seems to have focused on a technical solution to a communication problem. Of the three options presented by @guiomar, only # 2 (Add a warning on the specs to indicate that earlier dates may crash FIF files when working with M/EEG) seems to address the problem.


So, if I'm reading the original thread correctly, the problem is that a dataset might anonymize to a pre-1901 date. If the data needs to be converted to FIF (or FIF-encoded M/EEG data added in a later session), then it may be impossible to correctly encode dates across the dataset without changing all of the dates.

If this is correct, the solution does not seem to match the problem. Asserting anonymization occurred doesn't modify the need to shift the anonymization dates post-1901 if there's any risk a dataset or its derivatives might include FIF data. And for datasets yet to be collected or curated, the thing that needs to happen is that curators need to encode dates post-1901, regardless of any anonymization assertion flag.

It has always been possible to shift dates into some other range, and anonymization is simply something the dataset user has to trust. The idea with the pre-1925 dates is that it is impossible to have collected the data on those dates, and therefore it is unambiguous and directly validatable. Putting AnonymizedAcqTime or similar into the dataset_description.json would simply be encoding "trust me". This doesn't have obvious value to me.

jasmainak commented 4 years ago

Thanks for your thoughts @effigies !

I disagree that # 2 is necessarily the best solution. This is because it assumes that everyone who works with BIDS data reads the specification. More often than not, people use some software for anonymization, thus keeping this weird 1925 date means every MEG software that does anonymization for the purposes of sharing with the BIDS standard will have to add warnings and checks to prevent users from using pre-1901 dates.

Note that HIPAA does not require year to be anonymized and for this reason many packages implement the anonymization in terms of number of days to shift rather than via a specific date. This in turn means that we have to calculate the number of days until 1925, or alternatively change how anonymization is done. All this seems too much of a hassle when we should be in fact making it easier for folks to share data.

My final argument is that group studies can have a range of dates and it may be necessary to keep the relative times. In that case, one has to figure out the number of days that takes you back up to 1925 but not further than 1901 for each recording and then take their minimum/maximum.

Note that dates are not the only thing one has to anonymize, thus we still have to go the "trust me" route even with the 1925 criteria.

Moo-Marc commented 4 years ago

Putting AnonymizedAcqTime or similar into the dataset_description.json would simply be encoding "trust me".

Anonymization does not imply shifting dates, it is not required by all ethics boards, and it is only recommended in BIDS. So it has value to me to specify this. By your logic, having the year shifted very early would also be of no value.

Of course, you are correct that to directly address the issue, a warning could be added wherever we talk about shifting dates.

sappelhoff commented 4 years ago

I pushed a commit in https://github.com/bids-standard/bids-specification/pull/546 that should take care of a proper "warning".

Please review

Moo-Marc commented 4 years ago

While the warning is taken care of, any objection to going ahead with an explicit flag instead of an early date recommendation?
Should it be REQUIRED when dates are shifted? (seems logical to me, but also not essential) Any better name suggestions? Not sure "AnonymizedAcqTime" is that clear. "ShiftedDates" or "DateShifting" maybe?

robertoostenveld commented 4 years ago

Both the name of the "key" in the JSON but also the potential values need to be considered. I see multiple options:

There are multiple ways that the date and/or time can be modified:

I don't think that these cases allow themselves to be described properly with a single key-value pair like AnonymizedAcqTime=true/false

robertoostenveld commented 4 years ago

Oh, and I don't think that dataset_description.json is the right place to store this. According to https://bids-specification.readthedocs.io/en/stable/03-modality-agnostic-files.html#scans-file the acq_time information goes into scans.tsv, which means that this should be documented in a corresponding scans.json data dictionary.

There can be a separate scans.json for each subject (theoretically allowing for the acquisition time of some subjects to be anonymized, while that of others not) or using the inheritance principle there can be a scans.json at the top level.

Moo-Marc commented 4 years ago

There are multiple ways that the date and/or time can be modified

Sure, but I wouldn't go into too much detail. I think for most cases a simple note that the dates are shifted will alert the users not to conclude too much from them. I see little use in knowing date vs time details for example. Of course, within subject relative timing can be important as is already noted in the spec. We could have a few values: AcqTimeShift: None, Global, PerSubject, PerScan. I guess hyperscanning is one case where shifting globally might be preferred.

Regarding where to put the key, can it be in scans.json? I thought those files can only contain a description of the columns. It could be mentioned in the acq_time Description field, but that wouldn't be standardized. Is it possible to have the following?

"acq_time" : {
    "Description" : "Date/time of the scan",
    "Shift" : "PerSubject"
},

If so, I like that solution.

hoechenberger commented 4 years ago

@robertoostenveld

There can be a separate scans.json for each subject (theoretically allowing for the acquisition time of some subjects to be anonymized, while that of others not) or using the inheritance principle there can be a scans.json at the top level.

I very much like this idea. This would also enable us to lift the "pre-1925" limitation, as by looking at scans.json it will become clear whether the acq_time in scans.tsv was anonymized in some way or not.

@Moo-Marc

We could have a few values: AcqTimeShift: None, Global, PerSubject, PerScan.

I'm a little confused. What would "Global" imply, and why do you think it would only be relevant in hyperscanning? What would be the difference between PerSubject and PerScan?

Regarding where to put the key, can it be in scans.json? I thought those files can only contain a description of the columns. It could be mentioned in the acq_time Description field, but that wouldn't be standardized. Is it possible to have the following?

The current spec says:

Note however, when a JSON file is used as an accompanying sidecar file for a TSV file, the keys linking a TSV column with their description in the JSON file need to follow the exact formatting as in the TSV file.

Therefore,

Is it possible to have the following?

"acq_time" : {
  "Description" : "Date/time of the scan",
  "Shift" : "PerSubject"
},

this should very well be possible.

I like this solution very much (albeit I still need to understand the meaning of the value @Moo-Marc proposed!), and adopting the spec accordingly would still be backward-compatible with existing anonymized datasets (i.e. if scans.json doesn't exist, one could check whether acq_time is pre-1925 to conclude the dataset is anonymized).

What do you all think about this? If we can reach a consensus, I would start drafting a proposal for a BIDS standard amendment.

robertoostenveld commented 4 years ago

You can include additional columns in the scans.tsv as I believe is the case in every other BIDS tsv file. From the specification:

Additional fields can include ...

where you should interpret "fields" as "columns". Column headings must be in snake_case (whereas in the json files BIDS uses CamelCase) and the columns can be explained in more detail in a json file that accompanies the tsv.

So the _scans.tsv could be

filename                                          acq_time            acq_time_shift
anat/sub-04_ses-01_T1w.nii.gz                     1800-05-21T11:18:59 true
func/sub-04_ses-01_task-nback_run-01_bold.nii.gz  1800-05-21T11:23:59 true
func/sub-04_ses-01_task-nback_run-02_bold.nii.gz  1800-05-21T11:38:59 true
func/sub-04_ses-01_task-rest_bold.nii.gz          1800-05-21T11:53:59 true

and the corresponding _scans.json

{
  "filename": {
      "Description": "filename corresponding to the scan"
  },
  "acq_time": {
      "Description": "time of acquisition of the scan, formatted according to RFC3339"
  },
  "acq_time_shift": {
      "Description": "whether the acquisition time has been shifted for anonymization purposes",
      "Levels": {
        "true",
        "false"
    }
  } 
}

Although you could also drop the acq_time_shift column from the tsv and document the columns in _scans.json like this

{
  "filename": {
      "Description": "filename corresponding to the scan"
  },
  "acq_time": {
      "Description": "date and time of acquisition, formatted according to RFC3339 and shifted for anonymization purposes"
  }
}

Using the inheritance principle, you could choose to put a single _scans.json at the top level.

hoechenberger commented 4 years ago

@robertoostenveld

So the _scans.tsv could be

filename                                          acq_time            acq_time_shift
anat/sub-04_ses-01_T1w.nii.gz                     1800-05-21T11:18:59 true
func/sub-04_ses-01_task-nback_run-01_bold.nii.gz  1800-05-21T11:23:59 true
func/sub-04_ses-01_task-nback_run-02_bold.nii.gz  1800-05-21T11:38:59 true
func/sub-04_ses-01_task-rest_bold.nii.gz          1800-05-21T11:53:59 true

and the corresponding _scans.json

{
  "filename": {
      "Description": "filename corresponding to the scan"
  },
  "acq_time": {
      "Description": "time of acquisition of the scan, formatted according to RFC3339"
  },
  "acq_time_shift": {
      "Description": "whether the acquisition time has been shifted for anonymization purposes",
      "Levels": {
        "true",
        "false"
    }
  } 
}

I like this. 👌

agramfort commented 4 years ago

solution via _scans.tsv is ok for me. 2 remarks:

sappelhoff commented 4 years ago

the time should be possibly up to microsecond precision.

that will be possible starting with the next release, see latest spec:

Date time information MUST be expressed in the following format YYYY-MM-DDThh:mm:ss[.000000][Z] (year, month, day, hour (24h), minute, second, optional fractional seconds, and optional UTC time indicator). This is almost equivalent to the RFC3339 "date-time" format, with the exception that UTC indicator Z is optional and non-zero UTC offsets are not indicated. If Z is not indicated, time zone is always assumed to be the local time of the dataset viewer. No specific precision is required for fractional seconds, but the precision SHOULD be consistent across the dataset. For example 2009-06-15T13:45:30.

The validator has already been adjusted

hoechenberger commented 4 years ago

@agramfort

  • it really means that we "trust" the authors of the data to have actually anonymized if they scans say so. It means it becomes impossible to guarantee from the files

The current specs say:

Dates that are shifted for anonymization purposes should be set to a year 1925 or earlier to clearly distinguish them from unmodified data. Shifting dates is RECOMMENDED, but not required.

So it's already only a SHOULD, not a MUST. I believe it's only an implementation detail in MNE-BIDS that we enforce pre-1925 dates for anonymization.

Therefore, I believe we already have to trust users to do the right thing 👍 and the changes proposed here would actually improve the situation.

I think we could just keep this sentence, capitalize "SHOULD", and your issue would be addressed, right?

Dates that are shifted for anonymization purposes SHOULD be set to a year 1925 or earlier ...

agramfort commented 4 years ago

all good then on my side

sappelhoff commented 4 years ago

I think we could just keep this sentence, capitalize "SHOULD", and your issue would be addressed, right?

see: 9b1ae95b51c1921ec58023c057b016a35a6d37d8

hoechenberger commented 4 years ago

see: 9b1ae95

Great, thanks!

Moo-Marc commented 4 years ago

We could have a few values: AcqTimeShift: None, Global, PerSubject, PerScan.

I'm a little confused. What would "Global" imply, and why do you think it would only be relevant in hyperscanning? What would be the difference between PerSubject and PerScan?

@hoechenberger These were meant to refer to the shift value itself, not simply whether they are shifted or not. Global means the shift is the same throughout the dataset, per subject means the same shift is applied within a subject, per scan means the shift is different for each file. So for one global constant shift, I see this as being useful only in hyperscanning where you may want to be able to confirm which participants were scanned together. In any other case, I don't see the use of knowing the relative dates between participants.

While the extra column idea of @robertoostenveld works, I think most cases won't have a mix of anonymized and non-anonymized scans. I therefore prefer a way to specify it globally, and something standardized, not just written in the acq_time field description.

adam2392 commented 3 years ago

So the _scans.tsv could be

filename                                          acq_time            acq_time_shift
anat/sub-04_ses-01_T1w.nii.gz                     1800-05-21T11:18:59 true
func/sub-04_ses-01_task-nback_run-01_bold.nii.gz  1800-05-21T11:23:59 true
func/sub-04_ses-01_task-nback_run-02_bold.nii.gz  1800-05-21T11:38:59 true
func/sub-04_ses-01_task-rest_bold.nii.gz          1800-05-21T11:53:59 true

and the corresponding _scans.json

{
  "filename": {
      "Description": "filename corresponding to the scan"
  },
  "acq_time": {
      "Description": "time of acquisition of the scan, formatted according to RFC3339"
  },
  "acq_time_shift": {
      "Description": "whether the acquisition time has been shifted for anonymization purposes",
      "Levels": {
        "true",
        "false"
    }
  } 
}

Adding to the chorus here that a solution along these lines is desirable.

Just FYI there is an issue with the 1925 also in EDF (not just FIF). Raised in #698, this issue also came up with EDF files, where the file format itself prevents anyone from shifting dates to < 1985.

Link here related chat with Teuniz (maintainer of EDFBrowser): https://gitlab.com/Teuniz/EDFbrowser/-/issues/26https://gitlab.com/Teuniz/EDFbrowser/-/issues/26

guiomar commented 3 years ago

Hi everyone!

Shall we create a PR to deal with this following the changes suggested by @robertoostenveld ?

I also like a lot the values suggested by @Moo-Marc, and they could be easily integrated in the suggestion

None, Global, PerSubject, PerScan.

Moo-Marc commented 3 years ago

I second that though unfortunately I wouldn't have the time to do it soon. I much prefer being explicit about having shifted dates. And while the suggestion is already BIDS compliant (I think), it would be best to standardize the name of the added column in _scans.tsv and its possible values.