Export and share an RDF representation of the Image Data Resource (IDR)
Pitch
The Image Data Resource (IDR) is home to 13 million multi-dimensional image datasets. Each of these is annotated with (a subset of) Gene, Phenotype, Organism/Cell Line, Antibody, siRNA, and Chemical Compound metadata.
Initial work has been performed to export this information as RDF from the data management system (OMERO) where it is stored in PostgreSQL tables using https://pypi.org/project/omero-rdf.
The export of the largest single study (defined as a collection of the image datasets associated with a single publication), however, generates 100M triples. This study representing images of tissue from the Human Protein Atlas has been exported directly using SQL and parallelized scripts.
At this hackathon, we would like to:
Review the RDF structure and URIs to prepare them for production
Identify strategies for subsetting the RDF for various use cases
Draft endpoints (SPARQL, bioschemas) for the consumption of the subsets and test their scalability
Expertice needed
Familiarity with SPARQL, RDF, and ingestion/query optimization is required.
Familiarity with bioimaging data in general, or specifically the IDR and Human Protein Atlas datasets would be beneficial.
I like the details of the pitch: here are my thoughts:
maybe we can narrow down some of the use case here.
preparing a general format for learning the representation and relationship between the annotations. (eg, creating a ro-
crate format where all the information/properties are there. )
Short title
Export and share an RDF representation of the Image Data Resource (IDR)
Pitch
The Image Data Resource (IDR) is home to 13 million multi-dimensional image datasets. Each of these is annotated with (a subset of) Gene, Phenotype, Organism/Cell Line, Antibody, siRNA, and Chemical Compound metadata.
Initial work has been performed to export this information as RDF from the data management system (OMERO) where it is stored in PostgreSQL tables using https://pypi.org/project/omero-rdf.
The export of the largest single study (defined as a collection of the image datasets associated with a single publication), however, generates 100M triples. This study representing images of tissue from the Human Protein Atlas has been exported directly using SQL and parallelized scripts.
At this hackathon, we would like to:
Expertice needed
Familiarity with SPARQL, RDF, and ingestion/query optimization is required.
Familiarity with bioimaging data in general, or specifically the IDR and Human Protein Atlas datasets would be beneficial.