June DCP followup tracking

david4096 commented 6 years ago

This issue is meant to capture the meeting notes and conversation that will take place 9am PT and 12pm ET, July 12, 2018.

https://meet.google.com/qce-kmjd-ugc

Here is a draft agenda, please add links/notes to this thread as you see fit!

DATS related code concepts/use cases 10m
SPARQL queries against DATS 15m
Suggestions for ETL processes (triple stores?) 10m
Other? DBGap scraping/export techniques 10m
ElasticSearch mappers/analyzers and Team Calcium case study 5m

cmungall commented 6 years ago

The Helium folks have a f2f meeting so we may not be able to make it (the time is 12pm ET).

cc @scox @balhoff @putmantime @deepakunni3

deepakunni3 commented 6 years ago

@david4096 To follow up on Chris' comment, is there a Zoom/Webex URL for the meeting on July 12?

david4096 commented 6 years ago

@cmungall would an hour earlier work better? There's a doodle here: https://doodle.com/poll/4mtaeps9kp2pnqzh

@deepakunni3 Currently have a hangouts link https://meet.google.com/qce-kmjd-ugc .

Send me your email if you didn't get an invite! :) davidcs [at] ucsc . edu

cmungall commented 6 years ago

our meeting is all day, but I think we can keep the original time and have those of us involved in KC7 step out for this call

david4096 commented 6 years ago

A list of the existing DATS files that are available for indexing tests?

david4096 commented 6 years ago

![Uploading Screenshot_2018-07-12_09-09-41.png…]()

david4096 commented 6 years ago

Kirk, Team Oxygen is centered around serializing dataset metadata, not individual file metadata.

Adrienne, TopMed: inconsistency between datasets makes indexing files difficult.

Alejandra: https://github.com/dcppc/crosscut-metadata/tree/master/dats-json-examples , showed a DATS querying example, will add here.

Made new jsonld contexts.

Nemanja: Working with Charlotte where they are doing global dataset indexing (level 1 way of searching). SevenBridges is focused on level 2, using phenotype (and maybe genotype information) find files! Started working recently with DATS.

Uses a triple store.

Philippe Rocca-Serra 9:30 AM also from our group (phosphorus) we'd need to know how much description about the file content would need to be added , how much file introspection would be required . can we have a sense of what key use cases are currently being considered?

Checksum, checksum algorithm, urls, size

Jared: Data model in database, any files loaded go to the database, two releases a year, uses mongodb

Adrienne: SQL database that models the phenotype structure, files go into an EAV table since they're not really harmonized. Wanted relational to retain links to original file.

Anup: What is a dataset? How do we properly model?

agbeltran commented 6 years ago

DATS examples: https://github.com/dcppc/crosscut-metadata/tree/master/dats-json

The ETL pipeline is also available in the same repo: https://github.com/dcppc/crosscut-metadata

More DATS examples are at: https://github.com/datatagsuite/examples In particular see: https://github.com/datatagsuite/examples/blob/master/BDbag-AGR-example.json

david4096 commented 6 years ago

Slides from Phillippe https://docs.google.com/presentation/d/1PXTg6cpYuMXh9wAtiEdfeVc5kuvIR-yoXKXvluZdEHs/edit#slide=id.p

agbeltran commented 6 years ago

Documentation about DATS can be found here: https://datatagsuite.github.io/docs/html/dats.html

david4096 / metadata-indexing

June DCP followup tracking #1