SWAT4HCLS / Biohackathon-SWAT4HCLS-2023

0 stars 0 forks source link

[Pitch]: Providing triples for bioimaging datasets #2

Open joshmoore opened 1 year ago

joshmoore commented 1 year ago

Short title

Export and share an RDF representation of the Image Data Resource (IDR)

Pitch

The Image Data Resource (IDR) is home to 13 million multi-dimensional image datasets. Each of these is annotated with (a subset of) Gene, Phenotype, Organism/Cell Line, Antibody, siRNA, and Chemical Compound metadata.

Initial work has been performed to export this information as RDF from the data management system (OMERO) where it is stored in PostgreSQL tables using https://pypi.org/project/omero-rdf.

The export of the largest single study (defined as a collection of the image datasets associated with a single publication), however, generates 100M triples. This study representing images of tissue from the Human Protein Atlas has been exported directly using SQL and parallelized scripts.

At this hackathon, we would like to:

Expertice needed

Familiarity with SPARQL, RDF, and ingestion/query optimization is required.

Familiarity with bioimaging data in general, or specifically the IDR and Human Protein Atlas datasets would be beneficial.

ArghaSarker commented 1 month ago

I like the details of the pitch: here are my thoughts: