Disease BridgeDb - Githubissues

Dear BridgeDb people, do you think about a Disease BridgeDb? There are several databases and ontologies with disease IDs and ontology terms that cries for integration into the BridgeDb world. The mapping files are already available, we don't have to do disease mapping ourselves. I used and tested these during preparation of the gene-disease-provenance dataset paper. DisGeNET has a good repository of these linking CUI, OMIM, ORPHA, MeSH etc. I first tried to create a CyTargetLinker linkset (which worked, no problems here) but at the moment I can only link them to one ID system. Tina mentioned she could add a script that allows linking multiple ID systems but the most elegant solution would be a Disease BridgeDb. Best regards, Freddie

Interesting idea Freddie, we indeed discussed that in a few settings already, Can you write down some use cases that would actually use that BridgeDb database? That will make it easier to decide whether this is the right solution or not. I like the idea of starting this from DisGeNET if we decide to do it. But I think that should also mean we should get Laura and Jante involved in this from the start. Somebody will probably bring up that we could also use “improved DisGeNET” which will be at WikiData. That might be an option but I am not sure what the current organizational quality at WIkiData is (some of Denise’s SPARQL course examples did not make sense yet, like finding variants that really are effects on gene expression, not variants). Of course that aso links to the fact that Nuria is now working on WikiData and she is one of the people that understands decisions made on disease mapping in DisGeNET best. Could this also lead on some ideas on what the best primary ID to use would be if you can choose during FAIRification processes? That would be really useful for FAIRplus. (Note that the task itself could be relevant for both EJP RD and FAIRplus). Best, Chris

One example use case: -Authors from the Blau book would like to extend the PWs I'm currently making with disease linkouts (they add OMIM to their PWs, but links to other databases would be appreciated). Also being able to connect Phenotype info to these diseases would be useful (but that would be added from the LinkSet then). And I would like to mention that the example I used for the workshop does make sense (from the SPARQL point of view). If the data behind it is not correct, Wikidata allows you to fix that yourself (which I do regularly for metabolites). If we could get some people on board that know more about this, I would be happy to explain Wikidata to them if needed. We could check which core resources are identified by Elixir regarding diseases? Kind regards, Denise Slenter

The concrete example we (Denise and me) discussed over lunch was indeed adding disease nodes to a pathway. One option to add them would be providing gene-disease associations in a Cytargetlinker linkset and with a disease BridgeDb the linkset can link to multiple resources of diseases. Disgenet btw also has variant-disease associations (planing to include them in the linksets, too). So, the disease BridgeDb would allow to link variants-genes to diseases from different databases and ontologies - to mention a few I know off by heart to which Disgenet provides mappings: OMIM, CUI, ORPHA, ORDO, MeSH, DOID. @Denise: Recommended resources for Elixir rare diseases are DOID and Orphanet if I remember correctly. @Chris: DisGeNET was mentioned in the survey but not to a very high extend. In the current curated gene-disease-association dataset from Disgenet about 50% of CUI can be mapped to OMIM ids and about 35% to ORPHA ids. But of course OMIM is for mendelian diseases and ORPHA for rare diseases so I would not expect a too high overlap anyway. For my monogenic rare disease dataset more than 95% of CUI could be mapped to OMIM. Is there a possibility to load mapping information from a bridgeDb into Wikidata? or vice versa? for about 10 000 unique diseases (unique CUI identifiers) involved in 81 000 gene-disease associations and 160 000 variant disease associations an automized checkup and import would be good to have... Taking that one step further, diseases can be linked to phenotypes and disease superclasses (e.g. neuronal disorders) which might provide help in data clustering or sub-typing. Use case, e.g. extract from a dataset all genes that are downregulated and involved in neurological diseases. Best regards, Freddie

Thank you for the details ;) If we would have the identifier information in Wikidata, I could create a diseases.bridge (since the code to create a metabolites.bridge also works with wikidata). See for example the Wikidata page on Rett syndrome: https://www.wikidata.org/wiki/Q917357 . This doesn't have all the mappings that Freddie mentioned (OMIM, MeSH and DOID are in there; Orpha = Orphanet?, CUI = UMLS CUI?, ORDO I can't find...). If we would use Disgenet (or a combination of that and Wikidata), we need to attribute to them and mention their version number (according to their license statement). That would be the same as HMDb and ChEBI right @Egon? Kind regards, Denise Slenter

If we use whatever database we should mention that and the version number, not just for attribution but also for provenance. For the linksets we use VOID headers for that purpose. I am not sure we really have a clear policy for the databases. But I think we should. Note that if you use WIkiData this need does not go away. While, at least according to Andra, WikiData always captures the provenance it is not always immediately clear where a specific link comes from. I got the impression that for acceptance of WikiData it is really important to explicitly state where the information came for and only mention WIkiData as the only source if that is really the case (i.e. the edit was made on WIkiData itself without reference to another source (which should not happen)). Ohh just as important. If you do want to use WikiData and DisGeNET has more link types that are not in WikiData yet I would definitely talk to Andra and Nuria. They may have plans to integrate that already or maybe they are waiting for another round of curation, which we might then want to wait for too. About the pathways. I am not sure really. Of course you can add anything to pathways, but this is basically gene annotation (you can already map the whole pathway to a disease using the ontology tags). Annotations typically belong in backpages in pathways not in the pathways themselves (that does not exclude we could use a BridgeDb database (or actually 2) to populate the back page. (2 because the first would link gene-disease and the second would link different disease IDs). I think if you really want to explore gene/disease/phenotype relationships you would rather do that in networks I think. But then again you would need a gene-disease linkset first before you link diseases amongst themselves. Agree? Best, Chris

Having the mapping between the diseases in the linkset would be very nice though! We could of course also create a linkset for every database if you would want to show overlap between them. Best, Tina

I agree. But after thinking about it for a while I also think that: This will be a serious amount of work, because we will need to understand the logic behind the mappings that were made (these are not all “same as” and probably include a hierarchy or something (like we did with CheBi). That will probably lead to multiple options. Which means we will need to think about how to apply scientific lenses in this domain. A lot of work will also go into problems that we will have to dix through better curation too. To make this really useful we also need disease to gene mappings. That is might be someone simpler since we could use DisGeNET for that. Note that then we would want to know what the central disease-gene mappings in DisGeNET really are (my memory says mesh) and we would then probably use these as the central reference column in the disease mapping database as well (which may or may not work against using WikiData). Note that besides DisGeNET we could probably also use OrphaNet. But we could do that in parallel (a bit like the linksets we provide for CyTargetlinker where we offer multiple sets created from different sources in parallel. Best, Chris

The identifier mappings are already done. In Disgenet there are mapping files available that maps CUI to OMIM, Orpha, MeSH, HPO and a few more. So, someone else did all the work and hard thinking about mapping on disease ID to another. Unless we want to rethink that again - I won't exclude that on some point we will if we find somethings that could be done better! - we can just collect them and to bring them in the right format for a first version. And yes, there would be for some diseases multiple IDs.
Disease to gene mappings are also available. OMIM has one with at quite closed up license (morbidmap) - I used that as a starting point for my first try. Disgenet has also one with a very liberal license. Disgenet also has a variant-disease mapping. The gene-disease-associations are usually NCIT (gene) - CUI (disease) but they provide additional information and mappings in MeSH, Orpha, OMIM and HPO. Orphanet I have not checked that yet but I would guess they have also something like that, too. Providing 1 Cytargetlinker linkset per disease database/gene-disease mapping would work - should not be too much work to do - I already did that for the disgenet file (gene-disease and variant-disease). We just wont be able to use OMIM because of their restrictive license unless we work around it by using the mapping link from Disgenet. I presume 1 linkset for all diseases in the different formats/databases would be more... elegant. Best regards, Freddie

Another use case, at a link set level, is the "biomarker" project our intern Josien is currently working on, where we mostly use OMIM, but the BridgeDb IMS would surely make it more interoperable: https://github.com/egonw/biomarkers/blob/master/josien.ttl (We use mostly OMIM IRIs there) Egon

Yes, good example! But then a bridgedb mapping file would be the solution right, not a LinkSet? Kind regards, Denise Slenter

The disease bridgeDb should like disease IDs from different databases and ontologies. Linking diseases with other information like gene-disease, variant-disease, metabolite-disease - that would be (become) linksets for CyTargetlinker. Best regards, Freddie

They are basically the same thing: BridgeDb mapping files come in two flavors: Derby and link sets. The only difference is that the first works with Source Code and Identifier, and the latter with IRIs. Egon

Yes, that's what I thought. I could rewrite the metabolomics.bridge code to create diseases.bridge from Wikidata (at least). @Freddie: In which format are the links from Disgenet provided? TSV, CSV, XML, etc.? Or could I use the google spreadsheet you send a while back (that was derived from Disgenet right?) Kind regards, Denise Slenter

Disgenet provides information as RDF or CSV whereas the CSV is the more up to date one. Ideally we would not use a self-constructed file with wildly copied together information - too much manual work - ideally we can use something like wikidata as one resource. But I would wait with that until after the EJP annual retreat end of May - I just learned today on the telcon marathon that WP12 - the FAIRification work package - is using disease mapping as a demonstrator for their mapping tools. I am also in this group now and will see if I can get the information from this output. Best regards, Freddie

Egon already explained that BridgeDb uses both a Derby database and linksets, and that they are basically the same thing for different applications. But I think what Denise meant below was that this is a BridgeDb and not a CyTargerLinker linkset example. I think that is probably correct. Although Josien could in principle also create a network that a.o. would contain the biomarker metabolites and then link the diseasese to that using CyTargetLinker. I think it also is a nice example that shows that we would not necessarily need a linkset for genes to diseases first. But I still think that that would be a good idea too. For most gene related use cases we would need that. If I understand Josien’s project correctly the outcome of that project might actually be a parallel metabolite – disease linkset where the predicate is “is biomarker for” I would talk with the DisGeNETteam anyway before creating the links. I think you could aks them for an RDF update too if that would be handier. Best, Chris

bridgedb / BridgeDb

Disease BridgeDb #101