[Pitch]: Providing triples for bioimaging datasets

Short title

Export and share an RDF representation of the Image Data Resource (IDR)

Pitch

The Image Data Resource (IDR) is home to 13 million multi-dimensional image datasets. Each of these is annotated with (a subset of) Gene, Phenotype, Organism/Cell Line, Antibody, siRNA, and Chemical Compound metadata.

Initial work has been performed to export this information as RDF from the data management system (OMERO) where it is stored in PostgreSQL tables using https://pypi.org/project/omero-rdf.

The export of the largest single study (defined as a collection of the image datasets associated with a single publication), however, generates 100M triples. This study representing images of tissue from the Human Protein Atlas has been exported directly using SQL and parallelized scripts.

At this hackathon, we would like to:

Review the RDF structure and URIs to prepare them for production
Identify strategies for subsetting the RDF for various use cases
Draft endpoints (SPARQL, bioschemas) for the consumption of the subsets and test their scalability

Expertice needed

Familiarity with SPARQL, RDF, and ingestion/query optimization is required.

Familiarity with bioimaging data in general, or specifically the IDR and Human Protein Atlas datasets would be beneficial.

SWAT4HCLS / Biohackathon-SWAT4HCLS-2023