Variable categorization expansion/review/revision needed

Viqsi commented 2 months ago

During use case testing, Jake ran into an issue in which the "greater than" and "less than" constriants were not available for a numeric range variable (I16580) when he looked into downcoding it. It turned out that the reason for this is that said variable is categorized as "null" rather than "numberrange" and so it got a different list of constraint options.

On review, about a third of the Fully Harmonized Variables in our Standard Variable View are "null categorized". Many of these, however (including I16580) have useful validtion regexes, and @WValenti believes that the bulk of these can be handled by classifying according to the description of those regexes.

HOWEVER, as part of unrelated work, we've also found out that the distribution files and AutoQC system have their own scheme of variable type categorization that is similar to but not the same as ours, and so it might be worthwhile to, when making a mass-correction, switch over to that. This would require some changes to diverweb (though not particularly extensive ones).

So while this is marked as a "Public Launch" item, it could end up going beyond that depending on work investment involved.

WValenti commented 2 months ago

ALTER TABLE answers ADD type ENUM('CATEGORICAL', 'DICHOTOMOUS', 'DATE', 'MULTICATEGORY', 'NUMBERRANGE', 'TEXT'); ALTER TABLE answers ADD response_list TEXT; ALTER TABLE answers ADD mapping_list TEXT; UPDATE answers SET type = 'CATEGORICAL' WHERE VRMed = 'N' AND interpretation LIKE '%choice of%'; SELECT answer_id, SUBSTRING(interpretation,1,60) as interpretation, VRMed, type, response_list, mapping_list FROM answers WHERE VRMed = 'N' AND interpretation LIKE '%choice of%'; SELECT answer_id, SUBSTRING(interpretation,1,60) as interpretation, VRMed, type, response_list, mapping_list FROM answers WHERE VRMed = 'N' AND interpretation LIKE '%choice of%' AND response_list IS NULL; UPDATE answers SET response_list = '["1","2","3","4","[UNK]"]' WHERE answer_id = 16; UPDATE answers SET response_list = '["0","1","2","3","[UNK]"]' WHERE answer_id = 18; UPDATE answers SET response_list = '["1","2","3","4","5","6","7","8","9","[UNK]"]' WHERE answer_id = 21; ... still working...

WValenti commented 2 months ago

The previous comment gives "optional" SQL because we plan on transferring the table around. This comment has A BUG FIX that needs to be applied explicitly: UPDATE DIGS_questions SET explicit_context_id = 1 WHERE explicit_context_id IS NULL AND version = 'SCID-5 CV'; UPDATE DIGS_questions SET question_id = 1 WHERE question_id IS NULL;

WValenti commented 2 months ago

More for the "process" that will produce tables we can propagate: CREATE TEMPORARY TABLE answers_from_VRMs AS SELECT DISTINCT a.answer_id, c.response_list, c.mapping_list FROM answers a, DIGS_instruments b, variable_response_mappings c WHERE a.answer_id = b.answer_id AND b.variable_name = c.variable_name AND a.response_list IS NULL; DELETE FROM answers_from_VRMs where answer_id = 77 AND response_list = '["F","M"]'; DELETE FROM answers_from_VRMs where answer_id = 9; Screenshot 2024-09-19 at 6 06 05 PM DELETE FROM answers_from_VRMs WHERE answer_id = 170 AND response_list = '["N","Y"]'; -- 101 is a more degenerate case and hard to work due to hundreds of variables of that answer_id, so for now... DELETE FROM answers_from_VRMs WHERE answer_id = 101; DELETE FROM answers_from_VRMs WHERE answer_id = 5; -- Overlooked it earlier, got it now UPDATE answers a, answers_from_VRMs b SET a.type = 'CATEGORICAL', a.response_list = b.response_list, a.mapping_list = b.mapping_list WHERE a.answer_id = b.answer_id; UPDATE answers SET mapping_list = response_list WHERE type = 'CATEGORICAL' AND mapping_list IS NULL; ...and the answers table on Debbie.DIGS seems ready to go.

WValenti commented 2 months ago

For variable_response_mappings later on... INSERT IGNORE INTO variable_response_mappings SELECT DISTINCT variable_name, type, response_list, mapping_list FROM answers a, DIGS_instruments b

For variables NOW: ALTER TABLE variables ADD answer_id INT UNSIGNED AFTER metavalue_id; UPDATE variables a, DIGS_table_sets_X_DIGS_versions b, DIGS_instruments c SET a.answer_id = c.answer_id WHERE a.table_set_id = b.table_set_id AND b.version = c.version AND a.variable_name = c.variable_name;

WValenti commented 2 months ago

Another MUST DO on all DIGS_raw databases: Turns out there are 7 rows that have never gotten dist_ind_ids (at least in my old download) and I need them for regen.

UPDATE study24_01AUG2008_scid2_073108_sas7bdat SET dist_ind_id = '110-06010-2302' WHERE id = '06-01-023-02'; UPDATE study24_01AUG2008_scid2_073108_sas7bdat SET dist_ind_id = '110-06020-1501' WHERE id = '06-02-015-01'; UPDATE study24_01AUG2008_scid2_073108_sas7bdat SET dist_ind_id = '110-06020-1502' WHERE id = '06-02-015-02'; UPDATE study24_01AUG2008_scid2_073108_sas7bdat SET dist_ind_id = '110-06040-4010' WHERE id = '06-04-040-10'; UPDATE study24_01AUG2008_scid2_073108_sas7bdat SET dist_ind_id = '110-06100-1802' WHERE id = '06-10-018-02'; UPDATE study24_01AUG2008_scid2_073108_sas7bdat SET dist_ind_id = '110-06130-0202' WHERE id = '06-13-002-02'; UPDATE study24_01AUG2008_scid2_073108_sas7bdat SET dist_ind_id = '110-06130-1902' WHERE id = '06-13-019-02';

WValenti commented 2 months ago

Oh, and: UPDATE study24_01AUG2008_scid_073108_sas7bdat SET dist_ind_id = '110-06010-2302' WHERE id = '06-01-023-02'; UPDATE study24_01AUG2008_scid_073108_sas7bdat SET dist_ind_id = '110-06020-1501' WHERE id = '06-02-015-01'; UPDATE study24_01AUG2008_scid_073108_sas7bdat SET dist_ind_id = '110-06020-1502' WHERE id = '06-02-015-02'; UPDATE study24_01AUG2008_scid_073108_sas7bdat SET dist_ind_id = '110-06040-4010' WHERE id = '06-04-040-10'; UPDATE study24_01AUG2008_scid_073108_sas7bdat SET dist_ind_id = '110-06100-1802' WHERE id = '06-10-018-02'; UPDATE study24_01AUG2008_scid_073108_sas7bdat SET dist_ind_id = '110-06130-0202' WHERE id = '06-13-002-02'; UPDATE study24_01AUG2008_scid_073108_sas7bdat SET dist_ind_id = '110-06130-1902' WHERE id = '06-13-019-02';

WValenti commented 2 months ago

Whoops, didn't mean to close...I'm getting there tho!

WValenti commented 2 months ago

More table monkeying for regeneration and therefore categorization... ALTER TABLE DIGS_table_sets ADD last_regeneration_date DATETIME DEFAULT NULL AFTER regenerated; UPDATE variables SET width = 5 WHERE variable_name = 'SEV_DEP_FUNCTION' AND table_id = 31; UPDATE variables SET width = 5 WHERE variable_name = 'AUDITORY_HALNS_EVER' AND table_id = 49; UPDATE variables SET width = 5 WHERE width = 1 AND table_id = 49;

WValenti commented 2 months ago

For the raw file for 76/77 (076_20140724_study76_77_724_sas7bdat), 101 individual_name values are not the same as their dist_ind_id values. The best example is 275-20167 which is in there twice as a dist_ind_id probably due to matching ruid MH0157661. One of the two has individual_id 275-20165 and the other 275-20167. Since processing is driven by dist_ind_id, there is NO KIBBLE for dist_ind_id 275-20165 (77-275-20167-20167) but there is for 275-20167 (77-275-20167-20167). There are 11 total records like this which effectively cause duplicate dist_ind_ids when trying to regenerate cleaned: Screenshot 2024-09-21 at 10 07 37 AM Checking each: 269-10333-02 duplicates, like those for 275-20167 have same RUID and no entry for 269-10331-02, so needs to have dist_ind_id set to individual_name Here's a general query showing duplication of RUID: SELECT individual_name, dist_ind_id, alternate_id, ruid, ruid_duplicate_1, ruid_duplicate_2 from 076_20140724_study76_77_724_sas7bdat where dist_ind_id IN (Select dist_ind_id from 076_20140724_study76_77_724_sas7bdat group by dist_ind_id HAVING COUNT(*) >1) order by dist_ind_id; Screenshot 2024-09-21 at 10 20 32 AM So we can see that in every case the duplicate that doesn't match it's individual_name needs to be changed to do so. This: SELECT dist_ind_id, COUNT(*) FROM DIGS_cleaned.kibble WHERE dist_ind_id IN (SELECT individual_name from 076_20140724_study76_77_724_sas7bdat where dist_ind_id IN (Select dist_ind_id from 076_20140724_study76_77_724_sas7bdat group by dist_ind_id HAVING COUNT(*) >1) AND individual_name <> dist_ind_id order by dist_ind_id); Screenshot 2024-09-21 at 10 24 11 AM WEIRDLY shows that of the 11, only 262-14330-03 made it into cleaned kibble, so maybe its duplicate didn't? NO, FSCK, they're both in there. Really not clear on why. OK, someone hand-fixed it (and it's gotta be me) because there is no record in the raw with dist_ind_id 262-14330-03. SO, I can fix these all in raw with: UPDATE 076_20140724_study76_77_724_sas7bdat SET dist_ind_id = individual_name where dist_ind_id IN (Select dist_ind_id from 076_20140724_study76_77_724_sas7bdat group by dist_ind_id HAVING COUNT(*) >1) AND individual_name <> dist_ind_id; Fixed 11 rows. Searched for dups again and found that correcting one from that list of 11 created a new one: Screenshot 2024-09-21 at 10 42 35 AM ...so: UPDATE 076_20140724_study76_77_724_sas7bdat SET dist_ind_id = individual_name WHERE individual_name = '262-14384';

AAAAND THAT TABLE LOADED!

WValenti commented 2 months ago

76_ALL_GPC_SZ_PROBAND_AND_FAMILY_2016_2_1v3_phen_UTF8 doesn't have dist_ind_id, but it does have ind_id, so I'm just adding it: ALTER TABLE 76_ALL_GPC_SZ_PROBAND_AND_FAMILY_2016_2_1v3_phen_UTF8 ADD dist_ind_id VARCHAR(15); UPDATE 76_ALL_GPC_SZ_PROBAND_AND_FAMILY_2016_2_1v3_phen_UTF8 SET dist_ind_id = ind_id; and ALTER TABLE study77_phen_v2_UTF8 ADD dist_ind_id VARCHAR(15); UPDATE study77_phen_v2_UTF8 SET dist_ind_id = ind_id; and ALTER TABLE ct_s76s77_aux_UTF8 ADD dist_ind_id VARCHAR(32); UPDATE ct_s76s77_aux_UTF8 SET dist_ind_id = ind_id;

UPDATE variables SET width = 6 WHERE variable_name = 'NUMBER_GRANDCHILDREN'; # No clue how this was missed UPDATE variables SET width = 10 WHERE variable_name = 'DRAW_DATE';

WValenti commented 2 months ago

Seems like every one is done that can be done without table size problems, so: ALTER TABLE DIGS_table_sets ADD comments text;

UPDATE variables SET width = 6 WHERE variable_name = 'i21580'; UPDATE variables SET width = 5 WHERE variable_name = 'i13810';

WValenti commented 1 month ago

CREATE TABLE test_variable_response_mappings LIKE variable_response_mappings; ALTER TABLE test_variable_response_mappings ADD answer_id int(10);

PROBLEM - multiple answer_ids for the same variable_name in DIGS_questions.

CREATE TEMPORARY TABLE DUPCHECK AS SELECT DISTINCT a.variable_name, b.type, SUBSTR(b.response_list,1,30) responses, SUBSTR(b.mapping_list,1,30) mapping, b.answer_id FROM DIGS_instruments a, answers b WHERE a.answer_id = b.answer_id ORDER BY a.variable_name; -- Cases where we have different answer_ids for the same variable_name: 191 CREATE TEMPORARY TABLE FIXAID AS SELECT COUNT(), variable_name FROM DUPCHECK GROUP BY variable_name HAVING COUNT() > 1 ORDER BY COUNT() DESC; ALTER TABLE FIXAID ADD answer_ids varchar(32); CREATE TEMPORARY TABLE AIDPAIRS AS SELECT a.variable_name, GROUP_CONCAT(DISTINCT b.answer_id) answers FROM FIXAID a, DIGS_instruments b WHERE a.variable_name = b.variable_name GROUP BY a.variable_name ORDER BY answers; UPDATE FIXAID a, AIDPAIRS b SET a.answer_ids = b.answers WHERE a.variable_name = b.variable_name; SELECT COUNT(), answer_ids from FIXAID GROUP BY answer_ids ORDER BY COUNT() DESC; CREATE TEMPORARY TABLE PREFERAID AS SELECT COUNT(), answer_ids from FIXAID GROUP BY answer_ids ORDER BY COUNT() DESC; -- Pairs of answer_ids to resolve in favor of one or the other: 33 ALTER TABLE PREFERAID ADD preferred_aid INT(10); -- Fixing each one. For example 45,147 answer_ids are integer/integer or decimal minutes, and all the affected variables are in minutes, so 147 is preferred: SELECT version, variable_name FROM DIGS_instruments WHERE variable_name IN (SELECT variable_name FROM AIDPAIRS WHERE answers = '45,147'); UPDATE PREFERAID SET preferred_aid = 147 WHERE answer_ids = '45,147'; 45,147 is 147 45,76 is 76 10,155 is 155 4,45 is 4 101,156 is complicated because DIGS 2.1 is "any mark" while all 3.0 variants are 0=no,1=recalls,2=over 3 mo. This "downcodes" to "any mark", but loses info, so this needs to be a separate issue for repair. In the meantime, is 101, which is our only dichotomous. So UPDATE answers SET type = 'DICHOTOMOUS' WHERE answer_id = 101; 39,131 is 131 Safety first, making the results table permanent: CREATE TABLE preferred_answer_ids AS SELECT FROM PREFERAID; 37,142 is 142 (not categorical) 11,145 is 145 (dumb numberrange not categorical) 10,144 is 144 (same as above) 6,8 doesn't matter as they're both fuzzydate 10,153 is 10 (meaningfully categorical) 1,145 is 145 (dumb numberrange not categorical) 16,132 is 132 (meaningfully categorical) 7,8 doesn't matter - fuzzydates 18,143 is 143 (numberrange, not categorical) 18,152 is 152 (categorical) 1,9 is 1 (categorical), and find why SEV_FUNCTIONING_AFFECTED_FOR_G and MJ_USE_PSYCH_PROBLEMS aren't findalble in DIVER 1,45 is 45 (actual count) find why SEDS_W_D___OF_SEIZURES has TWO questions! 10,130 is 130 (categorical) 1,4 is 4 (1 is just plain wrong) 45,89 is 45 (limit of 8 is meh) 36,147 is 147 (mentions minutes) 1,106 is 106 (no Y/N aspect) 17,146 is 17 (no sense in restriction) 99,100 is 99 (definitely has 0=diag) 1,101 is 1 (not sure how 101 got in there!) 45,94 is 45 (not weeks, but days) 13,88 is 13 (just a small count of not significant limit) 36,45 is 45 (not sure what a fractional suicide is all about) 16,45 is 45 45,88 is 45 just a count 17,151 is 151 because the question specifically limits to 41 43,77 is ignored because it's a harmonization problem in which DIPAD is the legit outlier. This needs it's own issue. -- MAKE the source (AIDPAIRS) permanent for history: CREATE TABLE answer_id_pairs AS SELECT * FROM AIDPAIRS; -- AND fix answer_ids: (check first) SELECT a.answer_id, c.preferred_aid, a.variable_name FROM DIGS_instruments a, answer_id_pairs b, preferred_answer_ids c WHERE a.variable_name = b.variable_name AND b.answers = c.answer_ids; (yup, then fix): UPDATE DIGS_instruments a, answer_id_pairs b, preferred_answer_ids c SET a.answer_id = c.preferred_aid WHERE a.variable_name = b.variable_name AND b.answers = c.answer_ids AND c.preferred_aid IS NOT NULL;

THIS MAKES ALL OF THE answer_ids for variables of the same name consistent (except where more work is needed as indicated within)

INSERT IGNORE INTO test_variable_response_mappings SELECT DISTINCT variable_name, type, response_list, mapping_list FROM answers a, DIGS_instruments b;

WValenti commented 1 month ago

All of the issues arising from the previous comment's work have been added, so now it is time to bring variables up-to-date and do that INSERT into test_variable_response_mappings.

We start with a happy answers table (done - has type and mappings and is even in production), and a happy DIGS_instruments table (done - has correct answer_id, BUT IS NOT IN PRODUCTION).

First make a happy variables table just like we did for DIGS_instruments...

CREATE TABLE test_variables AS SELECT * FROM variables;
UPDATE test_variables a, answer_id_pairs b, preferred_answer_ids c SET a.answer_id = c.preferred_aid WHERE a.source = 'interview' AND a.variable_name = b.variable_name AND b.answers = c.answer_ids AND c.preferred_aid IS NOT NULL;
mysql (192.168.43.28 DIGS)>SELECT DISTINCT a.variable_name, a.answer_id, b.answer_id FROM test_variables a, test_variables b WHERE a.source = 'interview' AND b.source = 'interview' AND a.variable_name = b.variable_name AND a.answer_id > b.answer_id;
+-------------------+-----------+-----------+
| variable_name     | answer_id | answer_id |
+-------------------+-----------+-----------+
| SEX               |        77 |        43 |
| I3740             |         8 |         6 |
| I4770             |         8 |         6 |
| I4960             |         8 |         6 |
| I5240             |         8 |         6 |
| I17432            |         8 |         6 |
| I20490            |         8 |         6 |
| DATE_OF_INTERVIEW |         8 |         7 |
| ENTERED_DATE      |         8 |         7 |
| LOGIC_CHECK_DATE  |         8 |         7 |
+-------------------+-----------+-----------+
10 rows in set (0.58 sec)

So we have a new issue about the 43,77 SEX bit, and the 6,8 / 7,8 bits are different date formats which have been fully value harmonized by standardization already. So all date answer_ids should be 157! UPDATE variables SET answer_id = 157 WHERE answer_id IN (6,7,8,15,20); UPDATE DIGS_instruments SET answer_id = 157 WHERE answer_id IN (6,7,8,15,20);

CREATE TABLE test_variable_response_mappings LIKE variable_response_mappings;
ALTER TABLE test_variable_response_mappings CHANGE variable_type variable_type enum('CATEGORICAL','DICHOTOMOUS','DATE','FUZZYDATE','MULTICATEGORY','NUMBERRANGE','TEXT');
INSERT IGNORE INTO test_variable_response_mappings SELECT DISTINCT a.variable_name, b.type, b.response_list, b.mapping_list FROM variables a, answers b WHERE
a.answer_id = b.answer_id AND a.source = 'name_harmonized';

-- Testing it... SELECT a., b. FROM variable_response_mappings a, test_variable_response_mappings b WHERE a.variable_name = b.variable_name AND a.variable_type <> b.variable_type; AMAZING!! Screenshot 2024-10-03 at 12 57 16 PM The only differences are few and correct!

ON TO TESTING!

WValenti commented 1 month ago

Forgot to regenerate variable_response_mappings_expanded. That's done now, but a bit clunky to apply as an update as it needs population by a shell script.

Viqsi commented 1 month ago

This has now been deployed to all servers and so I'm calling it done; any further data type concerns can be new issues.

MathematicalMedicine / diver-issues

Variable categorization expansion/review/revision needed #253

PROBLEM - multiple answer_ids for the same variable_name in DIGS_questions.