Summarize data sharing requirements

surchs commented 3 years ago

Get a checklist for data sharing of HCP derivatives

Since all data to be shared are derived from HCP data, we must abide by their regulations. Let's first identify all the steps we need to take to be allowed to share the data.

What we want is a checklist for sharing data derived from the HCP dataset that covers all necessary steps. So the task is to either:

[ ] find a checklist with all required steps to share derivative HCP data (maybe somebody in the lab has already done that / it exists on their website ?)

OR, if that doesn't exist, to create this HCP specific checklist:

[x] find and roughly summarize all relevant documentation (e.g. here, and here) for HCP data sharing terms
[x] find any requirements on additional de-identification / exclusion of data for privacy protection, e.g. here
[x] find any requirements on the availability or licensing of HCP derivative data (do they require specific licensing terms or acknowledgements?)
[ ] compile that info into a checklist that somebody without your knowledge could follow
[ ] show this checklist to someone in the lab to give feedback on

Hopefully, all of this should be a very straightforward and easy process. And when completed, other's in the lab will be happy to save some time and gain clarity and that's nice too. If anything about this turns out to be tricky, please write that below!

Optional stuff that would also be great

Now since you are already doing this and since you want to submit to eLife (I think), it would make sense to take a look at the eLife policies for the "Availability of Data, Software, and Research Materials" and

[x] check if anything in there isn't covered by our data sharing plan so far
[x] if so, add it to the checklist under a subsection eLife / separate checklist ...

corinnerobert commented 3 years ago

I have a question about "institution specific regulations". I saw in the HCP open access data use terms that we need to comply with our institution specific regulations (ethics committee and so on ), but I don't know where to find this information. As I recall from the Éthique de la recherche avec des êtres humains (EPTC 2) course from Canada's government, It's not super clear whether open access data like from the HCP requires an ethics committee and I don't know if my institution requires one for this type of data either.

surchs commented 3 years ago

that's an important point. The data usage agreement (that you have linked) has already been submitted (most likely by Mallar) and signed (by Mallar and probably also you). That's how you got access to the data, particularly the restricted data. As part of this data usage agreement, there is guaranteed to have been some type of Douglas/McGill IRB approval that your research project falls under and is covered by. This is still a good thing to clarify because you will have to put reference this pretty much everywhere along the publication process (e.g. in the method section of your paper, during the submission process, for the data sharing...).

So, short answer: The ethics/IRB approval is guaranteed to already be done. It's a good idea to get the details (e.g. a reference number?). Most likely Mallar would have the answer to this.

corinnerobert commented 3 years ago

I have another question. So if I understand correctly, we cannot share subject's IDs so we need to make study specific ones. Althought what we want to share (the MRI derived maps) is not part of the restricted data, we would still make subject specific IDs since the other stuff (input matrices) uses restricted data, right? Also, except for family structure, we cannot share any restricted data including exact age (or age within a five years range), so I was wondering as the subjects are ordered by age in the input matrices, is that a problem?

edit: it should be fine if we don't specifically identify the subjects with our study-specific IDs in the matrices right?

surchs commented 3 years ago

we would still make subject specific IDs

Yes, I think so. To me the use of the study specific IDs is:

for the "general data release": we name the subject level files with our study specific IDs and then create this mapping to the real HCP IDs that's only accessible to other HCP users. This ensures authorized users can actually link our data with other things like phenotypic info they have
for the "paper data release": we don't really need the IDs here for most steps because at this point there aren't any subject-specific files (i.e. no one-file-per-subject). But we still need to share the ordered list of our study specific IDs as they are ordered in these group-level files so readers could theoretically get the restricted behavioural data and run the analysis step if they want.

This reminds me: we actually need to make sure that all intermediate / aggregate files in our "paper data release" are ordered the same way. I guess that's both a "data curation" and a "check the code" problem. Maybe we should add that to issue #6

since the other stuff (input matrices) uses restricted data, right?

No, I think the input matrices do not contain restricted data. See here.

so I was wondering as the subjects are ordered by age in the input matrices, is that a problem?

No, you aren't reporting exact age - which would be a restricted information. The fact that they are ordered by age isn't a problem. The imaging data itself is open access and properly defaced but you still ~~need to make sure what restrictions HCP has for sharing derivative data~~ (let me know if that isn't easily found out from their site, then we'll have to think again).

edit: it should be fine if we don't specifically identify the subjects with our study-specific IDs in the matrices right?

No, you can share the study specific IDs - that's what they are there for (see here). Because of the mapping you will create, only people who already have access to HCP will be able to link your study specific IDs to the "HCP IDs", for example to run the behavioural association analysis.

edit: sorry, I lost track of what issue I am replying to... so:

you are not going to be sharing any restricted data under the current data sharing plan
once we know what constraints HCP gives us for sharing the derivative data (derived from the open-access imaging data), we'll know how to handle that part too.

all good!

corinnerobert commented 3 years ago

edit: it should be fine if we don't specifically identify the subjects with our study-specific IDs in the matrices right?

I was asking this because here it says that : "If I publish data analyzed using Additional Restricted Data elements (including handedness, exact age, ethnicity, race, body weight, and all other types listed in section A.2), each reported analysis must be based on at least 3 subjects, and the presentation of the data must not reveal the study-specific subject ID associated with any particular data point or value."

But has our matrices don't contain restricted data it is fine to associate the study-specific IDs to each column vectors, but hypothetically, if each vector in the input matrices were associated with an exact age, then we wouldn't be allowed to also specify the study-specific IDs to the vectors

surchs commented 3 years ago

Yeah ok, I see the point. I don't really get this section either. But since you don't release any information on individuals other than the imaging data which is open access, I think you are good.

The way I read this section is: If your published results (e.g. a table with values or values in the text) based on restricted data elements, then what I report must be group results across at least 3 subjects and I cannot have my study-specific IDs linked to an individual data point.

So I think they want to say that you can't do something like: "Here I have reported on 3 subjects with low IQ, let's call them A, B, and C. Also, subject A is male and between 10 and 15 years old". This would presumably allow someone to deduce who subject A is, based only on the open data. Not super clear either.

But at the very least, they don't prevent you from sharing your "study-specific IDs" linked to the imaging data - because that would pretty much defeat the entire point of study specific IDs, I think.

The shortcut to all this is: have Mallar OK the data release plan once you are happy with it. His OK is your due diligence!

corinnerobert commented 3 years ago

The shortcut to all this is: have Mallar OK the data release plan once you are happy with it. His OK is your due diligence!

I talked to Mallar about this, and we decided to email the HCP to get some clarifications about the study-specific IDs. So I'm just waiting on their reply and then we should be good to go

corinnerobert / striatum_micro_nmf

Summarize data sharing requirements #5

Get a checklist for data sharing of HCP derivatives

Optional stuff that would also be great