characteristics[age] and characteristics[sex] columns vs. anonymization

eisenachM commented 4 years ago

Annotating PXD009203 I nearly caused an accidental de-anonymization: In the paper the individuals are mostly statistically aggregated. In supplemental table 1 nevertheless the study cohort is characterized in detail. With internal knowledge (which we have in the institute) about the mapping between raw file and supplemental sample number a de-anonymization of the raw files to study members (age, sex, smoker, previousBC) would occur, which they haven't agreed to in informed consent or data privacy forms. So I filled the term "AC=ICO:0000113;NT=anonymized information content entity" into all the cells to mark, that it is known but cannot be listed.

From the above it could be discussed, whether it is really good to have age and sex columns mandatory in the human template. If so, a hint in the documentation should be given about danger od de-anonymization / data privacy consent.

ypriverol commented 4 years ago

Hi @eisenachM thanks for your comments:

I understand the point that the data we want to collect can deteriorate anonymization. However, I see some issues here that we should discuss first.

The data we want to collect sex, age, ethnicity in most of the omics repositories are captured for all individuals (GEO, ArrayExpress, ENA). The institute should define protocols to ask for the data.
If you read most of the manuscript out there, you will notice that the authors always provide this information to characterize the human samples. This metadata is crucial to understand the experimental design.
About the notice in the specification, I think this is more a responsibility of the archives, repositories, and not the format specification to define what can lead to de-anonymization and what not. I know multiple efforts are running within ProteomeXchange to define those guidelines.

Input would be nice here @mvaudel @jgriss @levitsky @bigbio/collaborators .

anjaf commented 4 years ago

Yes, I agree that it is important to signal that wherever it is possible to publicize these donor annotations, they absolutely should be annotated.

For ArrayExpress submissions, we've seen an increase of people providing the information at the sample level, after prompting the categories in the submission template as mandatory. We do however allow it to leave the annotations blank, so we are not enforcing it. If the submitters are not allowed to share the details, they can still submit without major hurdles.

We also provide a disclaimer that it is the submitters responsibility to ensure that consent is granted for sharing data and metadata openly. This is indeed not next to the submission template but in the general archive submission guidelines.

jgriss commented 4 years ago

Hi,

We conduct (and publish) quite a few biomedical studies. As mentioned by @ypriverol it is generally required to publish the cohort's properties. I do not agree with @eisenachM that this simplifies de-anonymisation. With internal knowledge (in our case access to the hospital's patient management system) it is often possible to de-anonymize patients. Therefore, everyone with access to this privileged information is bound by a confidentiality agreement.

In your case, @eisenachM, it's about matching data from a supplementary table to the raw files. Is that correct? Thereby, this data is already publicly available. If you argue that this is not possible, then it's questionable whether the public deposition of the raw data is helpful since you cannot relate it to the samples.

Kind regards, Johannes

eisenachM commented 4 years ago

Hi!

I agree that the cohort’s properties have to be described in an article,

but classically that is done in an aggregated anonymized manner (mean of age in the

experimental groups, counts of gender etc.).

(In the GDPR individual non-anonymized publication of health data is not

part of the allowed “research” purpose, only in absolute exceptions with strong preconditions

fulfilled).

Our need for individual information comes of course with our wish to reanalyze data

in other than the originally intended research aim (often more on a meta-level),

I know this conflict from work in Ethics committee and as a data privacy officer.

In the suppl. table of the dataset’s article I annotated (PXD009203, PubMed: 30770125)

we have characterized the cohort in more than an aggregated manner, but haven’t

assigned raw file names to the properties (actually the index numbers of

the sample are shuffled in comparison to the raw files’ sample name indices).

If I would have filled the SDRF table completely (with my knowledge as co-author),

I would have done that the first time publicly.

A raw file / its results may contain “genetic” information (“single amino acid polymorphisms” )

that the patient has not agreed to submit. It may be not known at the time of consent,

which impact an SAP may have. That may even reveal genetics of relatives (not only children)

(https://dl.acm.org/doi/10.1145/2508859.2516707).

Not only health workers may breach anonymity, but relatives / friends / facebook followers

of the patient who know his / her participation in a study can de-anonymize having non-aggregated

knowledge (age, gender).

I do not want to hinder the columns in the format but want to have to hint

submitters to be aware of the risk and don’t make it mandatory. anjaf reported a kind

of compromise for the latter, to allow blank mandatory fields. I did even more in my

annotation of the PXD009203, stating „anonymized information content entity”.

As Yasset said, I also agree that not the format may give the hint, but the repository.

Bye

 Martin

Von: Johannes Griss notifications@github.com Gesendet: Donnerstag, 13. August 2020 20:36 An: bigbio/proteomics-metadata-standard proteomics-metadata-standard@noreply.github.com Cc: eisenachM martin.eisenacher@rub.de; Mention mention@noreply.github.com Betreff: Re: [bigbio/proteomics-metadata-standard] characteristics[age] and characteristics[sex] columns vs. anonymization (#409)

Hi,

We conduct (and publish) quite a few biomedical studies. As mentioned by @ypriverol https://github.com/ypriverol it is generally required to publish the cohort's properties. I do not agree with @eisenachM https://github.com/eisenachM that this simplifies de-anonymisation. With internal knowledge (in our case access to the hospital's patient management system) it is often possible to de-anonymize patients. Therefore, everyone with access to this privileged information is bound by a confidentiality agreement.

In your case, @eisenachM https://github.com/eisenachM , it's about matching data from a supplementary table to the raw files. Is that correct? Thereby, this data is already publicly available. If you argue that this is not possible, then it's questionable whether the public deposition of the raw data is helpful since you cannot relate it to the samples.

Kind regards, Johannes

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bigbio/proteomics-metadata-standard/issues/409#issuecomment-673642776 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ABV6G3RUSJNWT3QLWD5VZ5TSAQW73ANCNFSM4PXWQCNA .

javizca commented 4 years ago

Hi all,

In my view, this is not an issue related to the format. Some of the columns need to be mandatory in order to make the annotation useful and meaningful.

We should make even more clear, as Anja mentioned, that it is the submitters responsibility to ensure that consent is granted for sharing data and metadata openly. This is indeed not next to the submission template but in the general archive submission guidelines.

My guess is that what will happen is that these datasets will never get annotated to the same level of detail, and that's ok, because they are (or will be) subjected to different legal requirements we need to comply with. A second option is that the metadata (and possibly all the produced data) is only available in a controlled-access manner. But again, this is independent from the file format itself.

Thanks,

Juan

jgriss commented 4 years ago

Hi,

Just two recent landmark proteomics studies as examples where this direct link between raw files and patient metadata is provided:

Jiang et al., Nature 2019
Harel et al., Cell 2019 - annotated as SDRF here

This is just to highlight that this level of annotation is available for several studies.

As pointed out by @javizca it is the responsibility of the submitter to know what kind of information can be shared.

But from the data formats point of view, this is definitely the level of detail we want and need.

Kind regards, Johannes

mvaudel commented 4 years ago

Hi,

Like @jgriss we work with patient and cohort data and often face such situations where we are worried that the data we need to share to ensure scientific transparency and reproducibility compromises the privacy of participants in the study. Then, it is important to consider what is a reasonable risk, and whose responsibility it is. This is different for every cohort, and it is extremely important to check these things upfront and not on the day of the paper submission. As a general rule, you need the agreement of the cohort owner before disclosing anything.

Here are my two-cents on the examples discussed:

What if someone with deep access to cohort data use this knowledge to cross-compare datasets? While cohorts generate random identifiers for each delivery it is extremely easy to match patients using genotypes, phenotypes, imaging data, etc. But when you get access to the data you sign a binding agreement that you will not cross-reference these data and not attempt at de-anonymize the samples. So you can assume that researchers having access to intimate data will not use these to de-anonymize or cross-compare data sets. If they do, it is a crime and you are not the one to blame.
What if a participant shares information that allows identifying their data based on the variables that you shared? Here again, you are not the one disclosing the "key" to the identification, and I would argue that the cohort did a pretty bad job at raising awareness on data privacy if a participant brags online about the uniqueness of their data.
What if age/sex/ancestry are variables that should not be shared according to the cohort guidelines/consent forms? As you point out, these should only be shared in the form of summary statistics. A handy tip to maintain the usefulness of these variables while ensuring anonymity is to adjust the granularity of your summary statistics. For example, you can use categories like "age 20-30, 30-40, 40-50, 50+" so that a patient is never singled out. Then, you can still use this variable, albeit with reduced power. And here again, the level of granularity of the summary statistics disclosed is something that the cohort owners should decide, not the researchers :-)

Hope it helps,

Marc

ypriverol commented 4 years ago

Most of the topics were related with ProteomeXchange issues rather than the format. I will close the issue for now.

bigbio / proteomics-sample-metadata

characteristics[age] and characteristics[sex] columns vs. anonymization #409