MathematicalMedicine / diver-issues

Semipublic tracking of issues for the DIVER front end
0 stars 0 forks source link

dist_common's family_size column in the DIVER database doesn't match the number of individuals therein #265

Open Viqsi opened 1 month ago

Viqsi commented 1 month ago

This is something I ran into when trying to verify that I didn't break pedigree cohorts in the process of working on #236 and #261:

image

For the "minimum pedigree size" constraint, pedigree ascertainment checks the family_size column. Yet there's only four individuals present. Out of 69932 distinct fam_ids in dist_common, the overwhelming majority have the same number of individuals present that family_size would indicate; only 889 have a mismatch (and of those 889, only one family - 49-190-126 - has MORE individuals present; the rest have fewer). (I was able to determine that with the following query: WITH baseline AS (SELECT fam_id, COUNT(ind_id) AS indcount, family_size FROM dist_common GROUP BY fam_id) SELECT * FROM baseline WHERE indcount != family_size;)

@WValenti did some initial investigation and discovered that those family_size values are accurate in the DIGS database, but not in the DIVER database:

image

From here it evidently becomes a matter of rediscovering in the DIVER DB generation scripts why those individuals are kept out going from DIGS to DIVER, and whether or not/how family_size should be changed to match the actual total individual counts. That's in @WValenti's corner.