Closed eisenachM closed 4 years ago
Hi @eisenachM thanks for your comments:
I understand the point that the data we want to collect can deteriorate anonymization. However, I see some issues here that we should discuss first.
The data we want to collect sex, age, ethnicity in most of the omics repositories are captured for all individuals (GEO, ArrayExpress, ENA). The institute should define protocols to ask for the data.
If you read most of the manuscript out there, you will notice that the authors always provide this information to characterize the human samples. This metadata is crucial to understand the experimental design.
About the notice in the specification, I think this is more a responsibility of the archives, repositories, and not the format specification to define what can lead to de-anonymization and what not. I know multiple efforts are running within ProteomeXchange to define those guidelines.
Input would be nice here @mvaudel @jgriss @levitsky @bigbio/collaborators .
Yes, I agree that it is important to signal that wherever it is possible to publicize these donor annotations, they absolutely should be annotated.
For ArrayExpress submissions, we've seen an increase of people providing the information at the sample level, after prompting the categories in the submission template as mandatory. We do however allow it to leave the annotations blank, so we are not enforcing it. If the submitters are not allowed to share the details, they can still submit without major hurdles.
We also provide a disclaimer that it is the submitters responsibility to ensure that consent is granted for sharing data and metadata openly. This is indeed not next to the submission template but in the general archive submission guidelines.
Hi,
We conduct (and publish) quite a few biomedical studies. As mentioned by @ypriverol it is generally required to publish the cohort's properties. I do not agree with @eisenachM that this simplifies de-anonymisation. With internal knowledge (in our case access to the hospital's patient management system) it is often possible to de-anonymize patients. Therefore, everyone with access to this privileged information is bound by a confidentiality agreement.
In your case, @eisenachM, it's about matching data from a supplementary table to the raw files. Is that correct? Thereby, this data is already publicly available. If you argue that this is not possible, then it's questionable whether the public deposition of the raw data is helpful since you cannot relate it to the samples.
Kind regards, Johannes
Hi!
I agree that the cohort’s properties have to be described in an article,
but classically that is done in an aggregated anonymized manner (mean of age in the
experimental groups, counts of gender etc.).
(In the GDPR individual non-anonymized publication of health data is not
part of the allowed “research” purpose, only in absolute exceptions with strong preconditions
fulfilled).
Our need for individual information comes of course with our wish to reanalyze data
in other than the originally intended research aim (often more on a meta-level),
I know this conflict from work in Ethics committee and as a data privacy officer.
In the suppl. table of the dataset’s article I annotated (PXD009203, PubMed: 30770125)
we have characterized the cohort in more than an aggregated manner, but haven’t
assigned raw file names to the properties (actually the index numbers of
the sample are shuffled in comparison to the raw files’ sample name indices).
If I would have filled the SDRF table completely (with my knowledge as co-author),
I would have done that the first time publicly.
A raw file / its results may contain “genetic” information (“single amino acid polymorphisms” )
that the patient has not agreed to submit. It may be not known at the time of consent,
which impact an SAP may have. That may even reveal genetics of relatives (not only children)
(https://dl.acm.org/doi/10.1145/2508859.2516707).
Not only health workers may breach anonymity, but relatives / friends / facebook followers
of the patient who know his / her participation in a study can de-anonymize having non-aggregated
knowledge (age, gender).
I do not want to hinder the columns in the format but want to have to hint
submitters to be aware of the risk and don’t make it mandatory. anjaf reported a kind
of compromise for the latter, to allow blank mandatory fields. I did even more in my
annotation of the PXD009203, stating „anonymized information content entity”.
As Yasset said, I also agree that not the format may give the hint, but the repository.
Bye
Martin
Von: Johannes Griss notifications@github.com Gesendet: Donnerstag, 13. August 2020 20:36 An: bigbio/proteomics-metadata-standard proteomics-metadata-standard@noreply.github.com Cc: eisenachM martin.eisenacher@rub.de; Mention mention@noreply.github.com Betreff: Re: [bigbio/proteomics-metadata-standard] characteristics[age] and characteristics[sex] columns vs. anonymization (#409)
Hi,
We conduct (and publish) quite a few biomedical studies. As mentioned by @ypriverol https://github.com/ypriverol it is generally required to publish the cohort's properties. I do not agree with @eisenachM https://github.com/eisenachM that this simplifies de-anonymisation. With internal knowledge (in our case access to the hospital's patient management system) it is often possible to de-anonymize patients. Therefore, everyone with access to this privileged information is bound by a confidentiality agreement.
In your case, @eisenachM https://github.com/eisenachM , it's about matching data from a supplementary table to the raw files. Is that correct? Thereby, this data is already publicly available. If you argue that this is not possible, then it's questionable whether the public deposition of the raw data is helpful since you cannot relate it to the samples.
Kind regards, Johannes
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bigbio/proteomics-metadata-standard/issues/409#issuecomment-673642776 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ABV6G3RUSJNWT3QLWD5VZ5TSAQW73ANCNFSM4PXWQCNA .
Hi all,
In my view, this is not an issue related to the format. Some of the columns need to be mandatory in order to make the annotation useful and meaningful.
We should make even more clear, as Anja mentioned, that it is the submitters responsibility to ensure that consent is granted for sharing data and metadata openly. This is indeed not next to the submission template but in the general archive submission guidelines.
My guess is that what will happen is that these datasets will never get annotated to the same level of detail, and that's ok, because they are (or will be) subjected to different legal requirements we need to comply with. A second option is that the metadata (and possibly all the produced data) is only available in a controlled-access manner. But again, this is independent from the file format itself.
Thanks,
Juan
Hi,
Just two recent landmark proteomics studies as examples where this direct link between raw files and patient metadata is provided:
This is just to highlight that this level of annotation is available for several studies.
As pointed out by @javizca it is the responsibility of the submitter to know what kind of information can be shared.
But from the data formats point of view, this is definitely the level of detail we want and need.
Kind regards, Johannes
Hi,
Like @jgriss we work with patient and cohort data and often face such situations where we are worried that the data we need to share to ensure scientific transparency and reproducibility compromises the privacy of participants in the study. Then, it is important to consider what is a reasonable risk, and whose responsibility it is. This is different for every cohort, and it is extremely important to check these things upfront and not on the day of the paper submission. As a general rule, you need the agreement of the cohort owner before disclosing anything.
Here are my two-cents on the examples discussed:
Hope it helps,
Marc
Most of the topics were related with ProteomeXchange issues rather than the format. I will close the issue for now.
Annotating PXD009203 I nearly caused an accidental de-anonymization: In the paper the individuals are mostly statistically aggregated. In supplemental table 1 nevertheless the study cohort is characterized in detail. With internal knowledge (which we have in the institute) about the mapping between raw file and supplemental sample number a de-anonymization of the raw files to study members (age, sex, smoker, previousBC) would occur, which they haven't agreed to in informed consent or data privacy forms. So I filled the term "AC=ICO:0000113;NT=anonymized information content entity" into all the cells to mark, that it is known but cannot be listed.
From the above it could be discussed, whether it is really good to have age and sex columns mandatory in the human template. If so, a hint in the documentation should be given about danger od de-anonymization / data privacy consent.