V64 getting duplicate ind_ids when building kibbleset

Viqsi commented 4 months ago

Stumbled across this entirely by chance. This makes the variable unusable in cohorts (or anything that uses cohorts internally), because we then get a "primary key duplicate" error trying to add the results to cohortInds.

@WValenti's initial working theory is that it has to do with some kind of "overlap" between interview variables and distribution variables - and since the fix for something like that involves a clinical decision, that would have to be discussed with collaborators.

WValenti commented 1 month ago

Our long-term plan is to make it OK for there to be duplicate ind_ids in cohorts because we want to support multiple differentiated values for any given variable for each individual. This would allow us to support longitudinal studies. Adding this support will, unfortunately take more time than we currently have available, so I'm looking into a short-term "fix" that avoids the problem without adversely affecting our long-term plans.

The first step is to be certainly that our current duplicates issues is what we suspect it is. Thus far, we have:

study variables that have intrinsic duplicates, such as the study that provided multiple versions of the interview instrument for a few individuals depending upon their level of progress and corrections made by the interviewers. These have theoretically been removed by use of an "interview_to_use" flag in DIGS_interviews, but it should be checked. CHECKED and we're good, no duplicates here.
name-harmonized variables where study variable duplication propagates out. This tier of harmonization will not introduce new problems by definition. WRONG - SEX should show up here. I should have found this problem when I originally standardized DI-PAD, as same-name-different-range warrants RENAMING in the raw.
equivalence grouped variables where name-harmonized variable duplication propagates out, OR where the list of variable_names equivalenced includes multiple representations of the same information for some ind_ids. There are currently 27 of those:

[ ] I10410 (=value) - components 3.0r7 DCOCTHINKCLEAR / PSMR both from tsid 16, sects J & K, pgs 79 / 96, PSMR should not be in group. variable_name_equivalence_pairs shows the association is a typo, so need to move PSMR to PSYCHOSIS_W__MANIA.
[x] I14720 (<>value) - components ADMIT_MED_HOSP_AFTER_NUM1 and i14720 are both from tsid 109. i14720 is ANY hospital type, so MED hosp variants need their own equivalence group, just like PSYCH hosp variants apparently already have. Maybe. Example is 24-110-0601027-01. This is 3.0/I. It's more nuanced...see the edit.
[ ] I1510 (=value) - two ind_ids from tsid 5 have both V185 and V216, so that's wrong. Holy crap. They answered both the in-person and telephone sections for the question. The harmonization is good. Need guidance.
[x] I1520 (<>value) - same as i1510, but this time they didn't score the same! See MathematicalMedicine/CAPSDB/issues/63.
[x] I17447 (<>value) - tsid 5 variables v2432 and v2433 are both in the same group. ASPD and ADHD. Since i17447 is ASPD, v2433 is in the wrong group and should be moved.
[x] I17611 (<>value) - global suspiciousness i17611 (sec. M, 0-6) and i17636 (sec. W, 0-4) are both legit and in different sections of the same interviews. Sigh. Guidance says split them from each other and maybe annotate somehow.
[x] I20070 (<>value) - NRGR_I20070 and NRGR_V101 are both present for many ind_ids. i20070 is head injury and v101 is angina or MI, so v101 and NRGR_V101 get split-out.
[x] I20370 (=value) - same as i14720, but with V191 and V224. See MathematicalMedicine/CAPSDB/issues/63.
[x] I20750 (<>value) - everyone has i20750 AND SEV_MANIA_NUM_MIXED_EPS in tsid 109. i20750 is severe mania mixed number of SYMPTOMS, while SEV_MANIA_NUM_MIXED_EPS is severe mania mixed number of EPISODES. Replace with SEV_MANIA___MIXED_SX_S in I20750 group.
[ ] I3420 (=value)
[x] I4540 (=value)
[x] I4550 (<>value)
[x] I8480 (<>value)
[x] I8760 (<>value)
[ ] NRGRVERSION (=value) NRGR
[x] SEX (<>value) NRGR_DIPAD03. Both are TSID 121, so DI-PAD SEX is redundant and not standardized values, so simply rename it out. Part of this will be in a different patch due to being in DIGS_raw.
[x] V1537 (<>value) NRGR_
[x] V1541 (<>value) NRGR_
[x] V1543 (<>value) NRGR_
[x] V2196 (<>value) NRGR_
[x] V2239 (<>value) NRGR_ - Jake's issue. Has both V2239 and NRGR_V2239 and should only have one. NRGR_V2239 was created to convert the true/false values of V2239 to the 0/1 values of all other (DIGS V3-related) variables in the group, so V2239 should be excluded (not just moved elsewhere).
[x] V535 (<>value)
[ ] V64 (=value)
[x] V701 (<>value)
[x] V838 (<>value) NRGR_
[ ] V9 (=value)
[x] V973 (<>value) NRGR_

...so 10 are NOT equal and NOT easy NRGR_ fixes. The rest are either easy or equal so no rush. Note this list includes V64, which is the first member of AGE, which is the variable that was reported for this issue.

DIVER UNIONs of variable_names that combine multiple variables representing the same underlying information for some ind_ids.

Viqsi commented 1 month ago

Just so it's on the record somewhere other than my work diary - the "unable to preview download" error Jake's been running into related to this issue is because V2239 (one of thise identified above) is part of the variable set he's trying to download. So resolution of that would make his test pass.

WValenti commented 1 month ago

DATABASE UPDATES for this problem:

UPDATE equivalence_groups SET first_member = 'ANSYMMANY' WHERE first_member = 'V2239' AND variable_name <> 'V2239'; -- NOTE THAT changing variable_name_equivalence_pairs is not a simple process anymore because of things like this, so I want to say that we no longer generate equivalence_groups from variable_name_equivalence_pairs.

Viqsi commented 1 month ago

Changing the milestone on this one since while we can't solve the whole problem right away, we're trying to at least get substantial parts of it before final release since it's bitten our release candidate testers.

Viqsi commented 1 month ago

Switching this back to Second Public Release, as the immediate stuff (or at least the biggest step towards same) is tracked in #259.

To sum up: per discussion with @WValenti (and he can feel free to clarify further as needed) there are three classes of issue here, described here as best I can recollect: 1) Situations in which we end up with multiple instances of an individual with identical values for each instance. (This is being addressed in #259). 2) Situations in which we have harmonization problems, where variables that should not have been harmonized were done so in error, and in some cases lead to multiple instances of an individual with distinct (or distinct-appearing) values. 3) Situations that involve genuine longitudinal data, in which there are multiple instances of an individual with distinct values, and both are valid for different times and/or situations.

For example, the "V2239" problem that Jake ran into was an example of case 2 above. There were two values that were actually semantically the same but encoded differently; Bill had created a "NRGR_V2239" to fix the encoding issue so that V2239's information could still be found in the equivalence group/Fully Harmonized Variable, but then the original "raw" V2239 (with the original, different encoding) was also inadvertently inclued in the equivalence group. He removed it and All Was Well - but for that variable only, and these things have to be gone through one-at-a-time, and not all of them are going to be that "easily" resolved.

Case 3 above, in particular, is a really thorny problem, and the "ideal fix" - support for longitudinal data - is potentially years away. It's an enhancement we absolutely want to add someday, of course, but for now we don't have it, and any decision on which value should ultimately "win" is arguably technically a group discussion, potentially on a variable-by-variable basis. And that by itself could possibly take weeks.

So arguably what we need to do is get a complete handle on the scope of the problem, and then decide how to proceed for the release. For the time being, we're dealing with the common case (case 1 above) in #259, as we believe the number of variables in cases 2 and 3 are relatively few.

(For the record - and I've also noted this in #259 - @WValenti was in favor of an approach that would mask all these cases in the meantime, so things would continue to appear to work without any crashes, with some known documented caveats. We didn't do that as I objected because of concerns about "side effects" that he - and I hope/assume I'm representing his perspective on this accurately - considered extremely unlikely. Said approach might be revisited if this turns out to be even nastier, but I'd want to have group buy-in first.)

Viqsi commented 1 month ago

Forgot to mention in the above - @WValenti is planning on going through those variables and fixing them where they can be quickly fixed, so we're not just "leaving it at that". :)

WValenti commented 1 month ago

SUMMARY, ESTIMATE

There are around 27 fully-harmonized variables which have multiple values for the same individual. This causes DIVER to behave abnormally (crash) and prevents the creation of cohorts, downloads, etc. These duplications come from a variety of sources and require differing approaches and amounts of time to repair. One of them (thus far) needs group input (INPUT REQUIRED!), as it is a case in all DIGS variants -- global suspiciousness i17611 (sec. M, 0-6) and i17636 (sec. W, 0-4). These variables can be split apart so they are not harmonized, but this is probably not what the group wants to do.

I estimate having these resolved in all development and testing databases and ready for testing by Oct 28th.

WValenti commented 1 month ago

Earlier I omitted looking at name-harmonized variables because I was "sure" that I'd gotten all of the harmonization kinks worked out of those when I did standardization, but...no. This is a scripted survey as well and here are the problems:

[x] NRGR_RECENTLY_MANICDAYS - horrible case of custom NRGR script days.sh having bad input value for TSID 2.
[ ] NRGR_I3420 - from i3420 above, so not really separate.
[x] OTH_FUNCTIONING_AFFECTED - 6 ind_id pairs from TSID 19 all with the same values. E.g. 52-280-3075-50135.
[x] SEV_FUNCTIONING_AFFECTED - 150 ind_id pairs from TSID 19 MANY WITH DIFFERENT VALUES?!! E.g. 52-280-3015-50042.
[x] IND_ID_NUM - remove from equivalence_groups (and place in equivalence_groups_PTPHI) - in db_edit-XX.sh
[ ] SITE_ID - 361K ind_id pairs from TSIDs 1 and 2 and beyond, all duplicated values are the same. E.g. 1-24-240523-10265.
[ ] V3 - 7,072 ind_id pairs from TSID 5 with identical value of 'DIGS' (instrument) - e.g. 40-170-1035-001. Junk variable.
[x] SEV_FUNCTIONING_COMMENTED - 146 like SEV_FUNCTIONING_AFFECTED, and same example. Prolly identical but fewer answered. (V535)
[x] AGE - This is a component of V64 earlier. 3940 ind_id pairs from TSID 18 values are the same. E.g. 7-80-00701-20030.
[x] OTH_FUNCTIONING_COMMENTED 12 IND_ID PAIRS FROM TSID 19 with different values, same example as OTH_FUNCTIONING_AFFECTED. (v701).
[ ] DATE_OF_INTERVIEW - 3430 IND_ID PAIRS from TSID 1 with same values, e.g. 1-29-290289-10948. (V9)
[ ] VISDAY - TSIDs 137 and 138, same values, e.g. 17-76-023056-023056. (This is because of SCID-I and SCID-II redundancy)

Many of these are in the meta sections and are therefore not a concern. I'll annotate them.

WValenti commented 1 month ago

I believe all of the situations where the duplicate values were NOT EQUAL have been handled now, so for the time being, we can rely upon the SELECT DISTINCT to protect DIVER from the others (EQUAL ones).

I STILL have to reconstruct counts, but don't want to do that until Rutgers agrees that the duplicates issue is fixed.

Viqsi commented 1 month ago

Have verified that the "known troublemakers" in Rutgers RC testing (V64, V2239, I17447 per #263) all now let us create AECs just fine with @WValenti's db_edit 72 (currently on grace):

(Screenshot for I17447 is in a comment on that issue).

I think this is largely fixed and potentially could be closed once db_edit 72 is made available for collaborator testing, although will let that be @WValenti's call. (I just really really like putting closed issue tracker numbers on the Just Deployed On DIVER doc... ;) )

WValenti commented 1 month ago

Jo blesses this, so into DIVER it goes.

WValenti commented 1 month ago

Actually, I shouldn't close this until we've done the regeneration of counts, which should only happen when the changes have been vetted by Jake as well.

WValenti commented 1 month ago

db_edit-72 has been performed on d.m.o.

MathematicalMedicine / diver-issues

V64 getting duplicate ind_ids when building kibbleset #248