Closed apriltuesday closed 1 year ago
@M-casado
I've checked that you added code to increase the granularity of counts from CMAT, but do we have a sample of the output?
The raw output is really messy, partly because there's now quite a lot of information being provided (most of which won't make it into the paper) and partly because I didn't spend any time making it pretty... Here's the results from the run I did on Friday, so before the obsolete and child-of changes:
Overall counts (RCVs):
total 2975570
has_supported_measure 2972241
has_supported_trait 1913163
both_measure_and_trait 1909872
Gene annotations:
Total = 2972241
Category Count Percent F1 Score
exact_match 2559703 86.1% 1.00
cmat_superset 42035 1.4% 0.67
cmat_subset 273710 9.2% 0.63
divergent_match 3913 0.1% 0.46
mismatch 2978 0.1% 0.00
-->match 2879361 96.9% 0.96
-->both_present 2882339 97.0% 0.96
cv_missing 350 0.0% 0.00
cmat_missing 76160 2.6% 0.00
both_missing 13392 0.5% 0.00
Functional consequences:
Total = 2972241
Category Count Percent F1 Score
exact_match 2129268 71.6% 1.00
cmat_superset 0 0.0% 0.00
cmat_subset 753421 25.3% 0.65
divergent_match 0 0.0% 0.00
mismatch 0 0.0% 0.00
-->match 2882689 97.0% 0.91
-->both_present 2882689 97.0% 0.91
cv_missing 0 0.0% 0.00
cmat_missing 2788 0.1% 0.00
both_missing 86764 2.9% 0.00
By variant type:
Simple (genes):
Total = 2880084
Category Count Percent F1 Score
exact_match 2557424 88.8% 1.00
cmat_superset 42010 1.5% 0.67
cmat_subset 273438 9.5% 0.63
divergent_match 3909 0.1% 0.46
mismatch 2977 0.1% 0.00
-->match 2876781 99.9% 0.96
-->both_present 2879758 100.0% 0.96
cv_missing 326 0.0% 0.00
cmat_missing 0 0.0% 0.00
both_missing 0 0.0% 0.00
Simple (consequences):
Total = 2880084
Category Count Percent F1 Score
exact_match 2128158 73.9% 1.00
cmat_superset 0 0.0% 0.00
cmat_subset 751926 26.1% 0.65
divergent_match 0 0.0% 0.00
mismatch 0 0.0% 0.00
-->match 2880084 100.0% 0.91
-->both_present 2880084 100.0% 0.91
cv_missing 0 0.0% 0.00
cmat_missing 0 0.0% 0.00
both_missing 0 0.0% 0.00
Repeat (genes):
Total = 1657
Category Count Percent F1 Score
exact_match 1587 95.8% 1.00
cmat_superset 0 0.0% 0.00
cmat_subset 70 4.2% 0.64
divergent_match 0 0.0% 0.00
mismatch 0 0.0% 0.00
-->match 1657 100.0% 0.98
-->both_present 1657 100.0% 0.98
cv_missing 0 0.0% 0.00
cmat_missing 0 0.0% 0.00
both_missing 0 0.0% 0.00
Repeat (consequences):
Total = 1657
Category Count Percent F1 Score
exact_match 202 12.2% 1.00
cmat_superset 0 0.0% 0.00
cmat_subset 1455 87.8% 0.63
divergent_match 0 0.0% 0.00
mismatch 0 0.0% 0.00
-->match 1657 100.0% 0.68
-->both_present 1657 100.0% 0.68
cv_missing 0 0.0% 0.00
cmat_missing 0 0.0% 0.00
both_missing 0 0.0% 0.00
Complex (genes):
Total = 948
Category Count Percent F1 Score
exact_match 692 73.0% 1.00
cmat_superset 25 2.6% 0.67
cmat_subset 202 21.3% 0.66
divergent_match 4 0.4% 0.49
mismatch 1 0.1% 0.00
-->match 923 97.4% 0.91
-->both_present 924 97.5% 0.91
cv_missing 24 2.5% 0.00
cmat_missing 0 0.0% 0.00
both_missing 0 0.0% 0.00
Complex (consequences):
Total = 948
Category Count Percent F1 Score
exact_match 908 95.8% 1.00
cmat_superset 0 0.0% 0.00
cmat_subset 40 4.2% 0.56
divergent_match 0 0.0% 0.00
mismatch 0 0.0% 0.00
-->match 948 100.0% 0.98
-->both_present 948 100.0% 0.98
cv_missing 0 0.0% 0.00
cmat_missing 0 0.0% 0.00
both_missing 0 0.0% 0.00
Trait mappings:
Total = 2212567
Category Count Percent F1 Score
exact_match 995420 45.0% 1.00
cmat_superset 109 0.0% 0.80
cmat_subset 551990 24.9% 0.65
divergent_match 532 0.0% 0.50
mismatch 192012 8.7% 0.00
-->match 1548051 70.0% 0.88
-->both_present 1740063 78.6% 0.78
cv_missing 440068 19.9% 0.00
cmat_missing 13000 0.6% 0.00
both_missing 19436 0.9% 0.00
Obsolete terms:
cv_total 3368084
cmat_total 2405626
cv_obsolete 871376
cmat_obsolete 1579
@apriltuesday, thanks for the sample metrics. I actually just wanted the raw output, so I'm glad you didn't spend time making it pretty, it's just to compare the code and the output 👍
Merge after #390,
see here for the relevant diff(now rebased).Includes counts for the following: