EBIvariation / CMAT

ClinVar Mapping and Annotation Toolkit
Apache License 2.0
18 stars 10 forks source link

Final counts for CMAT #394

Closed apriltuesday closed 1 year ago

apriltuesday commented 1 year ago

Merge after #390, see here for the relevant diff (now rebased).

Includes counts for the following:

apriltuesday commented 1 year ago

@M-casado

I've checked that you added code to increase the granularity of counts from CMAT, but do we have a sample of the output?

The raw output is really messy, partly because there's now quite a lot of information being provided (most of which won't make it into the paper) and partly because I didn't spend any time making it pretty... Here's the results from the run I did on Friday, so before the obsolete and child-of changes:

Overall counts (RCVs):
total                  2975570
has_supported_measure  2972241
has_supported_trait    1913163
both_measure_and_trait 1909872

Gene annotations:
Total = 2972241
        Category    Count  Percent  F1 Score 
     exact_match  2559703    86.1%      1.00 
   cmat_superset    42035     1.4%      0.67 
     cmat_subset   273710     9.2%      0.63 
 divergent_match     3913     0.1%      0.46 
        mismatch     2978     0.1%      0.00 
        -->match  2879361    96.9%      0.96 
 -->both_present  2882339    97.0%      0.96 
      cv_missing      350     0.0%      0.00 
    cmat_missing    76160     2.6%      0.00 
    both_missing    13392     0.5%      0.00 

Functional consequences:
Total = 2972241
        Category    Count  Percent  F1 Score 
     exact_match  2129268    71.6%      1.00 
   cmat_superset        0     0.0%      0.00 
     cmat_subset   753421    25.3%      0.65 
 divergent_match        0     0.0%      0.00 
        mismatch        0     0.0%      0.00 
        -->match  2882689    97.0%      0.91 
 -->both_present  2882689    97.0%      0.91 
      cv_missing        0     0.0%      0.00 
    cmat_missing     2788     0.1%      0.00 
    both_missing    86764     2.9%      0.00 

By variant type:

    Simple (genes):
Total = 2880084
        Category    Count  Percent  F1 Score 
     exact_match  2557424    88.8%      1.00 
   cmat_superset    42010     1.5%      0.67 
     cmat_subset   273438     9.5%      0.63 
 divergent_match     3909     0.1%      0.46 
        mismatch     2977     0.1%      0.00 
        -->match  2876781    99.9%      0.96 
 -->both_present  2879758   100.0%      0.96 
      cv_missing      326     0.0%      0.00 
    cmat_missing        0     0.0%      0.00 
    both_missing        0     0.0%      0.00 

    Simple (consequences):
Total = 2880084
        Category    Count  Percent  F1 Score 
     exact_match  2128158    73.9%      1.00 
   cmat_superset        0     0.0%      0.00 
     cmat_subset   751926    26.1%      0.65 
 divergent_match        0     0.0%      0.00 
        mismatch        0     0.0%      0.00 
        -->match  2880084   100.0%      0.91 
 -->both_present  2880084   100.0%      0.91 
      cv_missing        0     0.0%      0.00 
    cmat_missing        0     0.0%      0.00 
    both_missing        0     0.0%      0.00 

    Repeat (genes):
Total = 1657
        Category  Count  Percent  F1 Score 
     exact_match   1587    95.8%      1.00 
   cmat_superset      0     0.0%      0.00 
     cmat_subset     70     4.2%      0.64 
 divergent_match      0     0.0%      0.00 
        mismatch      0     0.0%      0.00 
        -->match   1657   100.0%      0.98 
 -->both_present   1657   100.0%      0.98 
      cv_missing      0     0.0%      0.00 
    cmat_missing      0     0.0%      0.00 
    both_missing      0     0.0%      0.00 

    Repeat (consequences):
Total = 1657
        Category  Count  Percent  F1 Score 
     exact_match    202    12.2%      1.00 
   cmat_superset      0     0.0%      0.00 
     cmat_subset   1455    87.8%      0.63 
 divergent_match      0     0.0%      0.00 
        mismatch      0     0.0%      0.00 
        -->match   1657   100.0%      0.68 
 -->both_present   1657   100.0%      0.68 
      cv_missing      0     0.0%      0.00 
    cmat_missing      0     0.0%      0.00 
    both_missing      0     0.0%      0.00 

    Complex (genes):
Total = 948
        Category  Count  Percent  F1 Score 
     exact_match    692    73.0%      1.00 
   cmat_superset     25     2.6%      0.67 
     cmat_subset    202    21.3%      0.66 
 divergent_match      4     0.4%      0.49 
        mismatch      1     0.1%      0.00 
        -->match    923    97.4%      0.91 
 -->both_present    924    97.5%      0.91 
      cv_missing     24     2.5%      0.00 
    cmat_missing      0     0.0%      0.00 
    both_missing      0     0.0%      0.00 

    Complex (consequences):
Total = 948
        Category  Count  Percent  F1 Score 
     exact_match    908    95.8%      1.00 
   cmat_superset      0     0.0%      0.00 
     cmat_subset     40     4.2%      0.56 
 divergent_match      0     0.0%      0.00 
        mismatch      0     0.0%      0.00 
        -->match    948   100.0%      0.98 
 -->both_present    948   100.0%      0.98 
      cv_missing      0     0.0%      0.00 
    cmat_missing      0     0.0%      0.00 
    both_missing      0     0.0%      0.00 

Trait mappings:
Total = 2212567
        Category    Count  Percent  F1 Score 
     exact_match   995420    45.0%      1.00 
   cmat_superset      109     0.0%      0.80 
     cmat_subset   551990    24.9%      0.65 
 divergent_match      532     0.0%      0.50 
        mismatch   192012     8.7%      0.00 
        -->match  1548051    70.0%      0.88 
 -->both_present  1740063    78.6%      0.78 
      cv_missing   440068    19.9%      0.00 
    cmat_missing    13000     0.6%      0.00 
    both_missing    19436     0.9%      0.00 

Obsolete terms:
cv_total      3368084
cmat_total    2405626
cv_obsolete   871376
cmat_obsolete 1579
M-casado commented 1 year ago

@apriltuesday, thanks for the sample metrics. I actually just wanted the raw output, so I'm glad you didn't spend time making it pretty, it's just to compare the code and the output 👍