Define specific queries on specific CDEs modelled by Care-SM

mroos commented 8 months ago

Define SPARQL queries that can be used to answer the information needs described in the use case flash card and the mindmap linked from #57 by relying on the information models defined in the Virtual Platform Specification (VIPS)[^1] with extensions only where necessary.

List of models from VIPS used (add as necessary):

EJP RD meta data model – findability of rare disease resources
Clinical And Registry Entries (CARE) Semantic Model – core data standard describing common data elements essential for RD research

List of models not in VIPS used (add as necessary):

FAIR Genomes metadata schema – semantic metadata schema to power reuse of NGS data

Queries to implement:

[ ] Query 1: …
[ ] Query 2: …

[^1]: See VIPS 2.0, page 17

mroos commented 8 months ago

This issue is to define a specific case that we can perform, using a minimal number of data elements including a minimal number of CDEs such to demonstrate the use of Care-SM in queries.

Special request

Can we do a rare disease and oncology case in parallel: this will help adoption at the local institutes? Marco will bring this up with Karolis for the LCCO project.

wna-se commented 7 months ago

It would be very useful to have a realistic and compelling query that includes the CARE-SM data elements related to Clinical measurements and Genetic assessment with a description on the sizes and characteristics of the datasets it would be run on and the expected results.

Perhaps something based on B1MG D4.1 - Secure cross-border data access roadmap - 1v0: 3 example use cases as defined by WG8 and given to WP4 as example use cases

andrawaag commented 5 months ago

During biohackathon on Mai 3rd @wna-se and @andrawaag made this a place to collect query example to be transformed into SPARQL

wna-se commented 4 months ago

@markwilkinson Added you here as discussed during today’s meeting. Please link to / add any queries that you can share here and / or to the ejp-rd-vp/DistributedAnalysis repository.

wna-se commented 4 months ago

@NuriaQueralt For those working on Phenopackets and/or phenotypic data that could be present in case reports, the following datasets consisting of published case report translated to Phenopackets may be useful?

Peter N Robinson. (2021). Phenopackets for case reports of structural variants (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5071267
Peter N Robinson. (2020). 384 Phenopackets (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3905420

Also, the JSON Schema validator from the phenopackets / phenopacket-tools could perhaps be a useful resource to inspire the RDF/Shacl-mapped version. Notably, the folder with the JSON Schema gives an example of how they have mirrored the structure of the authoritative protobuf definitions, the validation rules could probably be translated into Shacl, and the the choice of uris used to reference the definitions could also be useful.

Is there an issue specifically for the Phenopacket work? Perhaps also relevant to @rosazwart ?

wna-se commented 4 months ago

@andrawaag : @mroos said that you would be a great person to take the lead on this task. As we are working on preparing the synthetic data we have for the VP it would be great to have some examples of genomic-related that we could prioritise mapping to, ideally a few queries based on the @mroos’ mindmap (see reference in the description of #57) and using the CARE-SM and FAIR Genomes semantic schema.

Edit: @ericprud : I’m also tagging you here as discussed during today’s meeting. It would be very helpful with some exemples

mroos commented 3 months ago

Update 10/6 @hbcesar, Annika, @andrawaag working on example queries. @pabloalarconm asked to provide example data for the queries (CSV + conversion method).

[ ] @ericprud asks to share resulting RDF into github for others in this group to use (or repo of choosing) @pabloalarconm

@andrawaag : may need an intermediate step first (chicken-egg). Need to find a way to get to the RDF, whereas Wolmar (and colleages) need help on converting from data that works for Beacon.

wna-se commented 3 months ago

@NuriaQueralt For those working on Phenopackets and/or phenotypic data that could be present in case reports, the following datasets consisting of published case report translated to Phenopackets may be useful?

Peter N Robinson. (2021). Phenopackets for case reports of structural variants (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5071267

Peter N Robinson. (2020). 384 Phenopackets (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3905420

Also, the JSON Schema validator from the phenopackets / phenopacket-tools could perhaps be a useful resource to inspire the RDF/Shacl-mapped version. Notably, the folder with the JSON Schema gives an example of how they have mirrored the structure of the authoritative protobuf definitions, the validation rules could probably be translated into Shacl, and the the choice of uris used to reference the definitions could also be useful.

Is there an issue specifically for the Phenopacket work? Perhaps also relevant to @rosazwart ?

@andrawaag Above are two references to collections of Phenopackets that represent published case reports and could be useful source materials to produce a realistic graph to query using @NuriaQueralt and @rosazwart mapping. The synthetic data that we have been working on in Sweden is a subset of files derived from the Rare Disease Synthetic Dataset available in full from the European Genome-Phenome Archive (EGA) through accession number EGAD00001008392, see example phenopacket, PDF describing the data and the full subset as well as derived files in NBISweden/ejprd-data/ .

wna-se commented 3 months ago

The CARE-SM/beaconAPI4CARESM also contains some SPARQL templates that can be used to serve a Beacon endpoint.

pabloalarconm commented 3 months ago

Hi @wna-se @mroos

Some of these tasks are tagging me in this conversation but its not clear what you need. As the main maintener of CARE-SM nowadays, what is exactly what you need from my contribution of your use case? (Probably you discussed in a meeting Im not involved to)

ShEx files for schema validation are already included at here
SPARQL queries have been always here There's two examples, but let me know if you need more cases to add here. SPARQL queries fragments from beaconAPI4CARESM are just fragments, hard to reuse in a first attempt but let me know if you need help with that (I can connect to a meeting to discuss its implementation out of this API)
I will add examplar RDF data to the CARE-SM implementation repo. DO you need to for every specific data element? Or a single example representation?

Bests, Pablo

wna-se commented 3 months ago

Pasted from e-mail by @NuriaQueralt on 20 June:

Dear all,

I have finished the phenopackets RDF model, in ShEx. You can have a look in github, in branch “”v2”. I modelled ONLY the elements required for the GDI use case. Rosa, I also modelled the Variant related elements for the LUMC data, so you can start adapting your RDFization pipeline.

Good news! We have a bunch of phenopackets that follow the current scheme version here: https://monarch-initiative.github.io/phenopacket-store/ I suggest to use these set for our POC. I may refine the model adding some RDF examples using these data, so I may do some changes to the model.

My apologies, I cannot make it to today-s meeting due to a clash in my agenda.

With kind regards, Núria

mroos commented 2 months ago

[ ] @markwilkinson among others will have to do this anyway

ericprud commented 2 months ago

I will add examplar RDF data to the CARE-SM implementation repo. DO you need to for every specific data element? Or a single example representation?

Ideally, we'd have a couple nice examples that demonstrated the breadth of the expressions. This will serve as documentation and inspiration for schema and queries. Such examples could be cobbled together from multiple instances of the current JSON data.

Having all the data would also be handy as it would help us verify schema and queries and provide a corpus for tests. Would also be nice for demos.

ejp-rd-vp / DistributedAnalysis

Define specific queries on specific CDEs modelled by Care-SM #28