VariantEffect / mavedb-api

MaveDB API
GNU Affero General Public License v3.0
8 stars 2 forks source link

MaveHGVS compliance for nonsense mutations #202

Closed jstone-uw closed 1 month ago

jstone-uw commented 1 month ago
jstone-uw commented 1 month ago

I've confirmed that the current validation code rejects HGVS strings containing *.

jstone-uw commented 1 month ago

In the current database, there are 48,722 variants in 6 published score sets, plus 731 variants in 4 unpublished score sets, whose hgvs_pro string contains an asterisk.

select * from variants v, scoresets ss
where
  (v.hgvs_pro like '%*]%' or v.hgvs_pro like '%*;%' or v.hgvs_pro like '%*' or v.hgvs_pro like '%.*%')
  and v.scoreset_id=ss.id
  and ss.published_date is not null;

There is also one variant (urn:mavedb:00000062-a-1#107) that uses the asterisk in a different way: p.Asn234Thrfs*5. This looks invalid to me, and maybe it's a typo.

select * from variants v, scoresets ss
where
  v.hgvs_pro like '%*%'
  and not (v.hgvs_pro like '%*]%' or v.hgvs_pro like '%*;%' or v.hgvs_pro like '%*' or v.hgvs_pro like '%.*%')
  and v.scoreset_id=ss.id
  and ss.published_date is not null;

I haven't spotted variants using single-character amino acid codes, but a full re-validation of existing variant strings might be worthwhile.

The rest of the score sets correctly use Ter in hgvs_pro strings. Valid asterisk are present in hgvs_nt and hgvs_splice strings.

jstone-uw commented 1 month ago

I propose we correct this manually by running

update variants v
set hgvs_pro=replace(hgvs_pro, '*', 'Ter')
where
  v.hgvs_pro like '%*]%' or v.hgvs_pro like '%*;%' or v.hgvs_pro like '%*' or v.hgvs_pro like '%.*%';

This has been run on the staging server and affected 49,453 rows as expected.

We can then edit the odd variant (typo?) urn:mavedb:00000062-a-1#107 manually after determining what it should be.