Complete and nonredundant database dump

ghost commented 7 years ago

I have a hard time to debug the import of eNM data into our system, because (a) a lot of data is duplicated e.g. between dataset, matrices, nanoparticles, studies and (b) I cannot find a generic way to parse all data from all bundles. E.g. after a lot of trial and error trying different routes we found that the only way to obtain the complete Protein Corona dataset (including proteomics, units, conditions, annotations etc) is to follow the links bundle -> dataset -> nanoparticles -> studies and requires some workarounds (e.g. to parse JSON from text entries). Unfortunately the same procedure seems to give only incomplete results for the MODENA dataset. I understand that each bundle has a different structure, but figuring out the structure and content involves at present a lot of guesswork and reverse engineering (and I am also lost with the GUI for this purpose). For this reason I would appreciate a simple way to download a complete and non-redundant copy of the eNM database (or individual bundles) in any format. Having the complete content in machine readable form would help us a lot to get a better overview about the currently available eNM content and to debug query, download and visualisation strategies.

vedina commented 7 years ago

The RDF dump could be done on substance level or on bundle level. The datasets etc provide intentionally different view of the same data. This is the typical way of providing different views for different user needs. The bundles include the very same substances and the RDF output of bundles and substances is the same.

All rhe bundles have the same structure, but of course the particular endpoints are different.

egonw commented 7 years ago

@vedina, can you give some guidance on how to download RDF for all data? Would make a useful two page tutorial :)

(Related, I was trying to figure which AMBIT Java code does which RDF export... related to https://github.com/enanomapper/data.enanomapper.net/issues/88

Also related is this issue: https://github.com/enanomapper/data.enanomapper.net/issues/80

vedina commented 7 years ago

@egonw - the substance/study/bundle RDF export is done by the class you are working on :)

The dataset RDF (implementing old OpenTox RDF for dataset) is different. Same for OpenTox feature RDF, etc. All are descendant of QueryRDFReporter .

egonw commented 7 years ago

Only that class? Mmm...

egonw commented 7 years ago

@vedina so, how is this class then called (at a Java method level) for the various entities in the OpenTox API (bundle, dataset, substance, compound)?

vedina commented 7 years ago

@egonw

each resource has a query and multiple reporters, responsible for handling different formats. The class you are working on ( SubstanceRDFReporter) is a reporter for RDF. It is attached to all resources (e.g. SubstanceResource )handling substances (including /bundle/{id}/substance )

api-docs

vedina commented 7 years ago

@helma - you should choose one of the serializations, which fit your needs.

e.g. retrieving all RDF of bundles one by one should give a complete dump (the RDF dump is considerably improved during the last few months by @egonw efforts)

vedina commented 7 years ago

@helma - why do you say each bundle has different structure?

ghost commented 7 years ago

@vedina I have tried RDF (JSON_LD) a couple of months ago and at that time the results were very incomplete (if I remember correctly substance entries were missing) so I reverted back to plain JSON (which I am talking about). With plain JSON I cannot get proteomics data at the substance level, which is why I have to use studies (if I remember correctly Proteomics is included in the matrix, but here either units or conditions were missing).

why do you say each bundle has different structure?

It was my impression from looking at the different JSON serialisations. For the Protein Corona bundle studies provide the most detailed information, while they are rather uninformative for the MODENA bundle (Features are named with EC50, EC25, EC50 SLOPE and can be differentiated only by their source URI, which leads to almost empty ontology entries). Not sure about your internal structure of course ...

vedina commented 7 years ago

@helma RDF is now much different and more complete than two months ago.

You are probably talking about dataset RDF (at bundle/id/dataset) . I would not recommend to use it at all. The dataset is an attempt to make a spreadsheet structure from the original study data which is more complex than spreadsheet. This is not optimal (results in data loss) and never could be. Please consider using bundle/id/substance instead, that contains the original study data and is the same as if retrieved via '/substance'.

egonw commented 7 years ago

@helma the RDF in the latest AMBIT code is greatly improved... please check the RDF dumps of this server (not sure which software version is running on data.enanomapper.net): https://apps.ideaconsult.net/enmtest/

ghost commented 7 years ago

@vedina Is JSON_LD fully supported or should I use another format? @egonw I intend to publish code and data in a docker image together with the paper. For this reason I would prefer the official version, but I will give the dev version a try if I run into problems. Will forward the URI to Denis, who had troubles to reproduce your SPARQL queries.

vedina commented 7 years ago

Yes, you could use JSON_LD , it is serialized from the same RDF data model

vedina commented 7 years ago

@helma I'will update both versions shortly.

Help/About/Version shows the versions (mouse over the menu item gives the build number and date)

ghost commented 7 years ago

Thanks a lot! Will try RDF after the update and report any problems.

egonw commented 7 years ago

@helma I just checked the version, and it's running an AMBIT version with the most recent RDF export code... (commit 7933) not perfect, but I'll keep improving it. Christoph, please file each thing you run into separately :)

egonw commented 7 years ago

@helma oh, one more thing... data.enanomapper.net has an older NanoWiki version.... the latest of NanoWiki is loaded on /enmtest/

ghost commented 7 years ago

It seems that apps.ideaconsult.net/enmtest/ runs on 3.0.3 r7944 and data.enanomapper on 3.03 r7933. Will try both versions ...

egonw commented 7 years ago

@helma this is 7933: https://github.com/egonw/ambit2-all/commit/7bbd10936a5312dfa1a90258a2457bd0554704c5 (8 nov)... no RDF patches since then.

gebele commented 7 years ago

@helma @egonw I do not get Protein Corona or Modena via https://apps.ideaconsult.net/enmtest/bundles ?

egonw commented 7 years ago

@gebele I would suggest to talk all from data.enanomapper.net except for NanoWiki, and take that from /enmtest/

vedina commented 7 years ago

thanks @egonw , good suggestion. the current /enmtest content was for testing the new NanoWiki and caNanoLab only

gebele commented 7 years ago

Ok I understand. For the sparql endpoint and examples I prefere one/official uri. When do you update the official NanoWiki ?

enanomapper / data.enanomapper.net

Complete and nonredundant database dump #92