Open WValenti opened 1 year ago
Actually, I can once again generate the REVID database pair which is completely de-identified, and perhaps it can be used for browsing given CAPS team approval. I'm doing the DIVER regeneration work right now anyhow, so it probably won't be difficult.
We're managing this by setting a policy such that no answers to any questions are made available until download time. Previousy, cohort summary/review could reveal that info, but we fixed that with #118 - and got rid of the offending API endpoint entirely. There are no longer any other API endpoints intended to make PHI available.
There are two known exceptions - one current, one upcoming:
Barring that, tho, the solution should therefore be that requiring a login to do a download (or view a pedigree drawing) will neatly solve the problem. Intention is to verify that once #166 is finished and only then close this bug; from that point forward, tracking potential PHI leaks is going to be less about a specific tracked bug and more a continued process of diligence w/r/t our API design.
Per discussions during 20240506 meeting, @WValenti has this taken care of now. We still don't have login-free browsing, but when we do it'll be all OK. So, closing.
Turns out there is still work to do (by weird circumstances, equivalence_groups summary fill slipped through), so I had some questions: Bill: So...regarding PHI and DIVER for HPO. I'm starting to feel that we have 3 categories of users: 1) People walking in off the street with no vetting who get summary-level access. Does this level even exist? 2) people with a valid NRGR account but nothing more, so they get summary access to everything, but no detail access? And 3) people with a valid NRGR account and authorized access to some or all collections, so they get summary access to everything, and to authorized collections in detail. If category #1 exists, I feel like they should only get access to the re-deidentified database (REVID). That's all we allowed Digital Wave, after all. 3:36 PM To rephrase, I guess I need a mapping of the types of authenticated access to what they are allowed to see, and the only types of authenticated access I believe might exist are: 1) nothing, not even an authenticated NRGR account, 2) authenticated NRGR account, but no collection access, and 3) authenticated NRGR account and some or all collection access. VV: 2 and 3, yes as you described. I don’t think the distinction between 1 and 2 has anything to do with access to summary data. If there is a distinction, it would be between the ability to save one’s work and return to it in another session (category 2), or not (category 1). (And Jo I think you told me this had nothing to do with logins…) But frankly I don’t think we need to have a category 1, at least not right now. If you and Jim decide for a category 1 at a later date that will be fine with me but I still think it would be access to the same summary data as for category 2.
This may have already been taken care of or not (would have to verify with @WValenti as I know he's done a lot of work on that recently), but it looks like we're tabling the anon browsing requirement altogether, so this can be considered done.
I'm going to re-open this as there needs to be an alternative solution. Currently I have added a flag field (potential_text_PHI) to the variables table that indicates if the variable is potentially PHI. The "text" bit was because I thought at first that only text variables would be a problem due to their unconstrained content, but the designation has been slighly expanded to include any information that Rutgers deems should be hidden from people without collection access. The flag will be added to the equivalence_groups table rows so the GUI can reference it. See MathematicalMedicine/CAPSDB#49 for progress.
I confess I are confuse. The prerequisite here is "if we do not require login, this is needed". We do not require login. Therefore that suggests it's not needed. Am I missing something?
In order to end my confusion every time I triage the bug list and see this, I'm repurposing this as a general PHI Display question issue. It is under active discussion, in fact, because some of the "masking" decisions made here are being looked at for use with regeneration of NRGR distribution files, and that may lead to some changes as to our policy since we're trying to keep all these policies in sync.
All text (answer_id 9) fields are PHI unless KNOWN to not be. All date fields need to be "fuzzed"