SuLab / scheduled-bots

GeneWiki Scheduled Bots
MIT License
9 stars 15 forks source link

modify CIViC bot to add dbsnp rs IDs as statements #38

Closed andrewsu closed 4 years ago

andrewsu commented 5 years ago

For example, on this record https://www.wikidata.org/wiki/Q28420832 for a KRAS mutation, the rs id (RS61764370) is noted in the item label and alias. Better if we added it as a specific statement.

Surprisingly, there is not "dbSNP ID" property yet, so will need a property proposal...

andrawaag commented 5 years ago

Property proposed: https://www.wikidata.org/wiki/Wikidata:Property_proposal/Natural_science#dbSNP_ID

andrawaag commented 5 years ago

Property proposal is accepted: https://www.wikidata.org/wiki/Property:P6861

andrawaag commented 4 years ago

I looked at the API output of CIViC (e.g. variant id 12, but it appears that dbSNP ID is actually not sourced from CIViC. On the rendered output on a random CIViC page, the dbSNP ID is mentioned, but this is directly sourced from myvariant.info. I could integrate that into the current CIViC bot, but that would basically be adding an additional primary API to the bot (i.e. being myvariant.info). I am wondering if it is not better to create a designated bot that only synchronises with myvariant.info. Doing so we could add other identifiers to Wikidata as well, e.g. cosmic, although that probably requires some additional property proposals.

andrewsu commented 4 years ago

Looks like the rsid is included under the variant_aliases key (at least for https://civicdb.org/api/variants/12). I agree that a myvariant-based bot would be best. But a quick and dirty option would be to scan variant_aliases with a regex rs\d+. Your call whether that is too quick and dirty...

andrawaag commented 4 years ago

I am a bit too uncomfortable scraping it that way, being quite pedantic at times about maintaining provenance. However, I created a draft version of a bot that sources myvariant.info and extends civic items with rsids and citations from dbsnp. source: https://github.com/SuLab/scheduled-bots/tree/master/scheduled_bots/myvariant See : https://civicdb.org/api/variants/12 for its results. For now, this is the only item being processed by this bot. The bot follows the following Schema: https://www.wikidata.org/wiki/EntitySchema:E103

andrewsu commented 4 years ago

+1 on a dedicated bot. I'm not sure the described by source statements are necessary personally. Clearly sometimes there are a lot of them, and I'm not sure they add a huge amount of value... But having said that I'm fine either way...

andrawaag commented 4 years ago

I am on the fence wrt described by source. I added it to have some more substance than only the dbSNP rsid. Also to add a more mature reference. Having a reference that a dbSNP statement is sourced from dbSNP does look rather redundant.

I will remove the described by source property form the bot (and schema), but leave the reference for discussion. Any preference for other myvariant properties to be added?

andrewsu commented 4 years ago

to be sure we're on the same page, I am looking at this diff/item https://www.wikidata.org/w/index.php?title=Q21851559&type=revision&diff=991027254&oldid=990985893. When you say "leave the reference for discussion" what reference are you referring to?

I'm not seeing any other critical myvariant properties to add at this moment...

andrawaag commented 4 years ago

I mean I chose the reference lists[stated in, retrieved, dbSNP id] in https://www.wikidata.org/wiki/Q21851559#P1343. But it really is minor thing and I am happy to park it and revisit when extra needed additions emerge.