Open Viqsi opened 4 months ago
Our long-term plan is to make it OK for there to be duplicate ind_ids in cohorts because we want to support multiple differentiated values for any given variable for each individual. This would allow us to support longitudinal studies. Adding this support will, unfortunately take more time than we currently have available, so I'm looking into a short-term "fix" that avoids the problem without adversely affecting our long-term plans.
The first step is to be certainly that our current duplicates issues is what we suspect it is. Thus far, we have:
...so 10 are NOT equal and NOT easy NRGR_ fixes. The rest are either easy or equal so no rush. Note this list includes V64, which is the first member of AGE, which is the variable that was reported for this issue.
Just so it's on the record somewhere other than my work diary - the "unable to preview download" error Jake's been running into related to this issue is because V2239 (one of thise identified above) is part of the variable set he's trying to download. So resolution of that would make his test pass.
DATABASE UPDATES for this problem:
UPDATE equivalence_groups SET first_member = 'ANSYMMANY' WHERE first_member = 'V2239' AND variable_name <> 'V2239'; -- NOTE THAT changing variable_name_equivalence_pairs is not a simple process anymore because of things like this, so I want to say that we no longer generate equivalence_groups from variable_name_equivalence_pairs.
Changing the milestone on this one since while we can't solve the whole problem right away, we're trying to at least get substantial parts of it before final release since it's bitten our release candidate testers.
Switching this back to Second Public Release, as the immediate stuff (or at least the biggest step towards same) is tracked in #259.
To sum up: per discussion with @WValenti (and he can feel free to clarify further as needed) there are three classes of issue here, described here as best I can recollect: 1) Situations in which we end up with multiple instances of an individual with identical values for each instance. (This is being addressed in #259). 2) Situations in which we have harmonization problems, where variables that should not have been harmonized were done so in error, and in some cases lead to multiple instances of an individual with distinct (or distinct-appearing) values. 3) Situations that involve genuine longitudinal data, in which there are multiple instances of an individual with distinct values, and both are valid for different times and/or situations.
For example, the "V2239" problem that Jake ran into was an example of case 2 above. There were two values that were actually semantically the same but encoded differently; Bill had created a "NRGR_V2239" to fix the encoding issue so that V2239's information could still be found in the equivalence group/Fully Harmonized Variable, but then the original "raw" V2239 (with the original, different encoding) was also inadvertently inclued in the equivalence group. He removed it and All Was Well - but for that variable only, and these things have to be gone through one-at-a-time, and not all of them are going to be that "easily" resolved.
Case 3 above, in particular, is a really thorny problem, and the "ideal fix" - support for longitudinal data - is potentially years away. It's an enhancement we absolutely want to add someday, of course, but for now we don't have it, and any decision on which value should ultimately "win" is arguably technically a group discussion, potentially on a variable-by-variable basis. And that by itself could possibly take weeks.
So arguably what we need to do is get a complete handle on the scope of the problem, and then decide how to proceed for the release. For the time being, we're dealing with the common case (case 1 above) in #259, as we believe the number of variables in cases 2 and 3 are relatively few.
(For the record - and I've also noted this in #259 - @WValenti was in favor of an approach that would mask all these cases in the meantime, so things would continue to appear to work without any crashes, with some known documented caveats. We didn't do that as I objected because of concerns about "side effects" that he - and I hope/assume I'm representing his perspective on this accurately - considered extremely unlikely. Said approach might be revisited if this turns out to be even nastier, but I'd want to have group buy-in first.)
Forgot to mention in the above - @WValenti is planning on going through those variables and fixing them where they can be quickly fixed, so we're not just "leaving it at that". :)
SUMMARY, ESTIMATE
There are around 27 fully-harmonized variables which have multiple values for the same individual. This causes DIVER to behave abnormally (crash) and prevents the creation of cohorts, downloads, etc. These duplications come from a variety of sources and require differing approaches and amounts of time to repair. One of them (thus far) needs group input (INPUT REQUIRED!), as it is a case in all DIGS variants -- global suspiciousness i17611 (sec. M, 0-6) and i17636 (sec. W, 0-4). These variables can be split apart so they are not harmonized, but this is probably not what the group wants to do.
I estimate having these resolved in all development and testing databases and ready for testing by Oct 28th.
Earlier I omitted looking at name-harmonized variables because I was "sure" that I'd gotten all of the harmonization kinks worked out of those when I did standardization, but...no. This is a scripted survey as well and here are the problems:
Many of these are in the meta sections and are therefore not a concern. I'll annotate them.
I believe all of the situations where the duplicate values were NOT EQUAL have been handled now, so for the time being, we can rely upon the SELECT DISTINCT to protect DIVER from the others (EQUAL ones).
I STILL have to reconstruct counts, but don't want to do that until Rutgers agrees that the duplicates issue is fixed.
Have verified that the "known troublemakers" in Rutgers RC testing (V64, V2239, I17447 per #263) all now let us create AECs just fine with @WValenti's db_edit 72 (currently on grace):
(Screenshot for I17447 is in a comment on that issue).
I think this is largely fixed and potentially could be closed once db_edit 72 is made available for collaborator testing, although will let that be @WValenti's call. (I just really really like putting closed issue tracker numbers on the Just Deployed On DIVER doc... ;) )
Jo blesses this, so into DIVER it goes.
Actually, I shouldn't close this until we've done the regeneration of counts, which should only happen when the changes have been vetted by Jake as well.
db_edit-72 has been performed on d.m.o.
Stumbled across this entirely by chance. This makes the variable unusable in cohorts (or anything that uses cohorts internally), because we then get a "primary key duplicate" error trying to add the results to cohortInds.
@WValenti's initial working theory is that it has to do with some kind of "overlap" between interview variables and distribution variables - and since the fix for something like that involves a clinical decision, that would have to be discussed with collaborators.