JSON export - Githubissues

ghost commented 8 years ago

This report is similar to issue #10. Here are my observations for JSON downloads.

Generally JSON downloads looks very good, basically what I expect from API specs. Links seem to be functional and the content makes sense (dataset specific observations below). The only thing that confuses me is that dataEntries are composed of compounds and not substances, but the compound URI correctly resolves to a substance URI.

Content-wise protein corona and nanowiki seem to be ok at a first glance, but I am confused by the Marina dataset. Similar to its CSV version it has only 6 substances, but a lot of values for the same property (property URIs give a 404 error, maybe an escape sequence problem).

vedina commented 8 years ago

Thanks.

It is true that the dataset serialisation mimics the OpenTox Dataset API, but is slightly different, as it has to work with substances, not compounds. Using the compound entry to specify substanceURIs is a hack, allowing to keep some compatibility. May be there are better approaches.

Note that the dataset serialisation is not the native one for the experimental data. The datasets are an attempt to make the experimental data tabular, and this is not straightforward, as there might be more than one way to do so. The native format is available via the substance and study API, e.g. https://apps.ideaconsult.net/enanomapper/bundle/4/substances

For example, it may be more instructive to look at the study details of FP7 MARINA data. https://apps.ideaconsult.net/enanomapper/substance/XLSX-7011cea0-1011-3f8b-9e8a-b3289fed836a/study. This is a very typical example, most of the data we are aware of will be in similar form.

Overall, IMHO the recommended approach is to retrieve the data through the substance API, and make tabular datasets as needed, and decide what belongs to the same column(s). Tthis is the approach taken by the NTUA conjoiner service described in the publication doi:10.3762/bjnano.6.165.

ghost commented 8 years ago

It is true that the dataset serialisation mimics the OpenTox Dataset API, but is slightly different, as it has to work with substances, not compounds. Using the compound entry to specify substanceURIs is a hack, allowing to keep some compatibility. May be there are better approaches.

Maybe redefine the OpenTox API to require substance entries. And make compound a subclass of substance to keep compatibility.

For example, it may be more instructive to look at the study details of FP7 MARINA data. https://apps.ideaconsult.net/enanomapper/substance/XLSX-7011cea0-1011-3f8b-9e8a-b3289fed836a/study. This is a very typical example, most of the data we are aware of will be in similar form.

Ah, here are concentrations, time-points, replicates etc! I think I get it slowly. Nanowiki, Protein corona and Modena datasets are more like traditional OpenTox datasets (more or less tabular form, with aggregated results, directly usable for modelling). Marina is more like ToxBank investigations, with several results, protocol and raw experimental data. In order to make the Marina data usable for modelling, read-across etc it would be necessary to add nanoparticle characterisations and calculate something like EC50.

I am not sure how difficult it would be to provide a clearer distinction between these two types of datasets. Maybe use the ToxBank investigation class for Marina type data and it would not show up as dataset (saving the hassles and inconsistencies of tabular serialisation). If someone wants to calculate lets say EC50s from the raw data the results could be submitted as an OpenTox dataset.

vedina commented 8 years ago

Redefining the API - compounds are not exactly a subclass, more like substances are composed of (one or more) compounds.
NanoWiki is mixed type, as it is extracted from the literature. One can find dose response there as well.
Right, MARINA data as currently is a single investigation, but it is just because we have atm one slice of the entire project data. With all of MARINA data it will become more complicated. The substance storage supports derived data (e.g. EC50s can be calculated externally and stored in the same substance / study storage, linking with the source data), image links, etc. But if one has a use case that needs tabular data, of course it could be copied elsewhere.

Regarding the datasets - these are not the native way of representing the experimental data.The bundles API was introduced partially in order to make distinction with OpenTox datasets, but still offer similar serialisation.
In fact the API (and the UI to be exact) allows to mix and match substances and endpoints from different datasets, so these substance datasets are not necessary fixed entities, but more like collections of experimental data. For example one can try to mix the NM-* physchem characterisation from nanoWiki with MARINA NM, all the info is already available, but these are not necessary the same dataset. There could be another bundle to contain subsets of both datasets.

IMHO the clearer approach is to have a separate service that combines the experimental data in one or another way and feed a modeling service. One might want to arrange the tables for modelling in different ways. One can do dose response modelling or other non-classic-QSAR type of modelling directly on the non-tabular data.

Overall, I think the OpenTox tabular datasets were designed and optimised for modelling, but are too restrictive to be able to describe the experimental data with all of it complexity. So the bundles are provided for convenience only, but it is not mandatory to use them for modelling. Adding types for the bundles make sense, but more from information point of view rather than restricting or hiding the data.

vedina commented 8 years ago

As AMBIT relies on the current JSON for visualisation, UI and use cases other than enanomapper, large changes in the JSON are not really an option in near future.

@helma As an alternative I suggest to put some efforts into JSON-LD design in order to incorporate these thoughts. The RDF issue is #9 (covering also JSON-LD).

enanomapper / data.enanomapper.net

JSON export #11