Open mroos opened 8 months ago
This issue is to define a specific case that we can perform, using a minimal number of data elements including a minimal number of CDEs such to demonstrate the use of Care-SM in queries.
Special request
It would be very useful to have a realistic and compelling query that includes the CARE-SM data elements related to Clinical measurements and Genetic assessment with a description on the sizes and characteristics of the datasets it would be run on and the expected results.
Perhaps something based on B1MG D4.1 - Secure cross-border data access roadmap - 1v0:
During biohackathon on Mai 3rd @wna-se and @andrawaag made this a place to collect query example to be transformed into SPARQL
@markwilkinson Added you here as discussed during today’s meeting. Please link to / add any queries that you can share here and / or to the ejp-rd-vp/DistributedAnalysis repository.
@NuriaQueralt For those working on Phenopackets and/or phenotypic data that could be present in case reports, the following datasets consisting of published case report translated to Phenopackets may be useful?
Also, the JSON Schema validator from the phenopackets / phenopacket-tools could perhaps be a useful resource to inspire the RDF/Shacl-mapped version. Notably, the folder with the JSON Schema gives an example of how they have mirrored the structure of the authoritative protobuf definitions, the validation rules could probably be translated into Shacl, and the the choice of uris used to reference the definitions could also be useful.
Is there an issue specifically for the Phenopacket work? Perhaps also relevant to @rosazwart ?
@andrawaag : @mroos said that you would be a great person to take the lead on this task. As we are working on preparing the synthetic data we have for the VP it would be great to have some examples of genomic-related that we could prioritise mapping to, ideally a few queries based on the @mroos’ mindmap (see reference in the description of #57) and using the CARE-SM and FAIR Genomes semantic schema.
Edit: @ericprud : I’m also tagging you here as discussed during today’s meeting. It would be very helpful with some exemples
Update 10/6 @hbcesar, Annika, @andrawaag working on example queries. @pabloalarconm asked to provide example data for the queries (CSV + conversion method).
@andrawaag : may need an intermediate step first (chicken-egg). Need to find a way to get to the RDF, whereas Wolmar (and colleages) need help on converting from data that works for Beacon.
@NuriaQueralt For those working on Phenopackets and/or phenotypic data that could be present in case reports, the following datasets consisting of published case report translated to Phenopackets may be useful?
- Peter N Robinson. (2021). Phenopackets for case reports of structural variants (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5071267
- Peter N Robinson. (2020). 384 Phenopackets (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3905420
Also, the JSON Schema validator from the phenopackets / phenopacket-tools could perhaps be a useful resource to inspire the RDF/Shacl-mapped version. Notably, the folder with the JSON Schema gives an example of how they have mirrored the structure of the authoritative protobuf definitions, the validation rules could probably be translated into Shacl, and the the choice of uris used to reference the definitions could also be useful.
Is there an issue specifically for the Phenopacket work? Perhaps also relevant to @rosazwart ?
@andrawaag Above are two references to collections of Phenopackets that represent published case reports and could be useful source materials to produce a realistic graph to query using @NuriaQueralt and @rosazwart mapping. The synthetic data that we have been working on in Sweden is a subset of files derived from the Rare Disease Synthetic Dataset available in full from the European Genome-Phenome Archive (EGA) through accession number EGAD00001008392, see example phenopacket, PDF describing the data and the full subset as well as derived files in NBISweden/ejprd-data/ .
The CARE-SM/beaconAPI4CARESM also contains some SPARQL templates that can be used to serve a Beacon endpoint.
Hi @wna-se @mroos
Some of these tasks are tagging me in this conversation but its not clear what you need. As the main maintener of CARE-SM nowadays, what is exactly what you need from my contribution of your use case? (Probably you discussed in a meeting Im not involved to)
ShEx files for schema validation are already included at here
SPARQL queries have been always here There's two examples, but let me know if you need more cases to add here. SPARQL queries fragments from beaconAPI4CARESM are just fragments, hard to reuse in a first attempt but let me know if you need help with that (I can connect to a meeting to discuss its implementation out of this API)
I will add examplar RDF data to the CARE-SM implementation repo. DO you need to for every specific data element? Or a single example representation?
Bests, Pablo
Pasted from e-mail by @NuriaQueralt on 20 June:
Dear all,
I have finished the phenopackets RDF model, in ShEx. You can have a look in github, in branch “”v2”. I modelled ONLY the elements required for the GDI use case. Rosa, I also modelled the Variant related elements for the LUMC data, so you can start adapting your RDFization pipeline.
Good news! We have a bunch of phenopackets that follow the current scheme version here: https://monarch-initiative.github.io/phenopacket-store/ I suggest to use these set for our POC. I may refine the model adding some RDF examples using these data, so I may do some changes to the model.
My apologies, I cannot make it to today-s meeting due to a clash in my agenda.
With kind regards, Núria
- I will add examplar RDF data to the CARE-SM implementation repo. DO you need to for every specific data element? Or a single example representation?
Ideally, we'd have a couple nice examples that demonstrated the breadth of the expressions. This will serve as documentation and inspiration for schema and queries. Such examples could be cobbled together from multiple instances of the current JSON data.
Having all the data would also be handy as it would help us verify schema and queries and provide a corpus for tests. Would also be nice for demos.
Define SPARQL queries that can be used to answer the information needs described in the use case flash card and the mindmap linked from #57 by relying on the information models defined in the Virtual Platform Specification (VIPS)[^1] with extensions only where necessary.
List of models from VIPS used (add as necessary):
List of models not in VIPS used (add as necessary):
Queries to implement:
[^1]: See VIPS 2.0, page 17