Manually alter data ingestion with a blocklist

jsstevenson commented 1 year ago

Some data may contain human curation errors, or may result in data that we're unhappy with. We should have some sort of manually-maintained blocklist or allowlist to define tweaks to certain concepts.

We did a pretty simple version of this in https://github.com/cancervariants/therapy-normalization/pull/300. Not necessary to follow the same approach to a T but any improvements we make here should also be reflected there.

This is coming up in the context of the NCBI entry for NRAS, which includes KRAS as an alias as of 2023/01/24.

Unclear of the provenance, but this is also reflected in the NRAS entry for DGIdb v4 and CIViC (they are presumably sourcing their aliases from the NCBI entry). Per @ahwagner , we will want to make sure this is fixed on our end to ensure that it isn't propagated further down.

ahwagner commented 1 year ago

Here is what I think we should focus on, generally:

Identifying aliases that match primary gene symbols of other genes. (@anastasiasmith1221 this is a great question for you)
Tagging these aliases in gene normalizer. Users should be warned somehow.
Providing a mechanism by which we can manually review and approve / reject these as valid aliases, with provenance

Where we use gene normalizer (e.g. DGIdb) we should make policy choices on whether or not to remove / flag / ignore warnings from step 2. I think for DGIdb we may wish to flag.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 180 days with no activity. This issue will be closed if no further activity occurs in 14 days.

cancervariants / gene-normalization

Manually alter data ingestion with a blocklist #165