bio2rdf / bio2rdf-scripts

Scripts that Bio2RDF users have created to generate RDF versions of scientific datasets
http://bio2rdf.org/
Other
129 stars 46 forks source link

Modifying Drugbank RDF structure #435

Closed maulikkamdar closed 8 years ago

maulikkamdar commented 8 years ago

Hi,

I am tagging this as an enhancement, and obviously we can also have a script that does this post-RDFization. In case of Drugbank, some of the key (calculated/experimental) properties of a given drug that might be of use for querying and filtering can only be retrieved by hopping through an intermediate node and filtering on the type of the retrieved resource as well as the value. e.g. Say I want the Mass and the Formula of a given drug, I need to execute the following query:

SELECT *  WHERE { 
<http://bio2rdf.org/drugbank:DB09041> drugbank:calculated-properties ?cap .
?cap a ?type ;
dc:title ?title;
drugbank:value ?value
}

I can then filter over the type to allow only http://bio2rdf.org/drugbank_vocabulary:Molecular-Formula and http://bio2rdf.org/drugbank_vocabulary:Molecular-Weight.

This is different in Kegg, where I can just get the Triple Pattern Fragment as follows :

kegg:D10169 kegg_vocabulary:mol_weight ?weight.
kegg:D10169 kegg_vocabulary:formula ?formula.

The second approach seems much more simple, reduces the level of complexity of a SPARQL query, and greatly reduces the HTTP requests if I use a TPF Server as I do not need to retrieve all the calculated properties and then filter.

This will also aid me a lot later in terms of query federation where I don't need to use source-specific SPARQL constructs for a query "Get me all drugs with molecular weight less than 500 and hydrophobicity less than 0" (properties can directly be mapped to a universal representation).

Let me know what you think!

I understand the current Drugbank data model allows to include the consideration of whether a measure is "calculated" or "experimental", but it can also be done in other way such as:

<http://bio2rdf.org/drugbank:DB09041> drugbank:molecular-formula ?resource.
?resource rdf:type drugbank:Calculated-Property ;
drugbank:value ?value 

The latter structure will again significantly reduce the number of TPF queries (not like the second way, but still), as I can filter on the molecular-formula rather than retrieving all calculated properties.

micheldumontier commented 8 years ago

I believe that the drugbank data also contains the _source _of the experimental or calculated property. so either you would need to reify the property with this metadata, or you use the approach that was taken here - n-ary object model.

While the former has been investigated [1], it is not widely used, and is not truly compatible with ontology-based query answering [2]. It is also not compatible with the nanopublication publication model [3]

[1] http://ceur-ws.org/Vol-1546/paper_11.pdf [2] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3632999/ [3] http://link.springer.com/chapter/10.1007%2F978-3-319-25010-6_18

in any case, i might suggest that you create SPARQL queries to rewrite the data structure into something simpler for TFP.

maulikkamdar commented 8 years ago

Thanks for sharing the papers!