BANDA-connect / NDA-sprint

repo to discuss NDA sprint
2 stars 1 forks source link

Create a query and data visualization tool over the NDAR database exported on AWS #7

Open mgxd opened 8 years ago

mgxd commented 8 years ago

Goal: To develop a graph based representation of the NDAR database exported on AWS.

Impact: Allow programmatic queries of the database

Participants: Satrajit Ghosh (MIT) and Nolan Nichols (SRI)

Deliverables: An amazon machine image that exposes a SPARQL endpoint through blazegraph, and a set of initial queries that can be graphically modified by users.

Future development and maintenance: It will be used by the parent project for its internal data and will be maintained online in a GitHub repo, also containing public versions of assays.

obenshaindw commented 8 years ago

The NDA team are proposing some participants for each issue along with a list of available resources (internal and external), and possible limitations / needs.

NDA Point Person(s)

David Obenshain @obenshaindw

Available Resources

Comments / Limitations / Needs

satra commented 8 years ago

@obenshaindw - we should definitely look at the neo4j instance. the one reason i like blazegraph is that it supports both sparql and the tinkerpop. also rdf gives us the ontology connection to data dictionaries as well. i think hacking at graph dbs and creating useful queries would be a nice outcome from this sprint, especially if we can couple it within an intuitive interface.

one reason we had this based on initial conversations with dan was the absence of programmable API on the NDA frontend. so if this can be merged with a programmable API that would be great.

our basic idea was:

  1. expose data through miNDAR 2a. hack miNDAR with d2rq.org or r2rml 2b. direct table to graph conversion
  2. store the converted data in a graph db
  3. build query and visualization tools around the graph db

in addition, we should consider two separate data types:

  1. the file based data
  2. the numeric/text data directly in the db
danhall100 commented 8 years ago

While I know the emphasis is in experiment relationships, I have to push our rdf-like definition for clinical categories. We hope that rdoc will drive these types of phenotypic constructs - it's a foundation of rdoc - allowing easy integration of phenotype. Attached is a sample file of 1,000 of the 2,000,000 records we have (guid was scrambled for posting). We'll like serve these derived fields - along with imaging derived fields - once our query by guid web service comes on line.
concept_by_guid.xlsx

nicholsn commented 8 years ago

@danhall100 cool, thanks for the example. It looks like the hierarchy is encoded into a single field.

\\Personal Traits\\Stereotyped, Restricted, and Repetitive Behavior\\Restricted and Repetitive Behavior\\Restricted and Unusual Interests\\Intense Interests\\Excessive Intense Interests

Is this the format that the data will be returned in via the web service? Or could we suggest another representation?

danhall100 commented 8 years ago

This is just a query I wrote to show you that we have it available. The hierarchy is a self-referential table but we have it in Json that I'll post tomorrow. You can visualize it at http://data-archive.nimh.nih.gov by Just clicking on the bubbles until you get to the leaf level rules, which are in a child table. From that we have a procedure to find everything we know per subject/age. We don't yet make it available without intervention, but can provide it, so just tell us how you would like to see each component (Hierarchy, Rules, Results) but we thought we'd make it available in the authenticated query by guid web service we are currently working on. Also, we'd love for someone to extend it!!!! Let's discuss tomorrow.

danhall100 commented 8 years ago

Attached are excel tables we use to generate the concepts and rules. We also have created a json representation of this which we are planning to push into elastic search so you could consume it there. We're trying to get the community to extend this. We have rules against 4K elements, which is a lot, but it only touches 2% of our data elements.

For json: The “elementTree” in this document is a dictionary giving the metadata for each concept. The “categoryTree” in this document is a dictionary showing the hierarchy of concepts and the number of individuals concept.xlsx concept_query.xlsx

concepts-rules.json.txt concepts-rules_prettier.json.txt

associated with each concept.

nicholsn commented 8 years ago

@danhall100 great, thank you for the examples!