GenomicsStandardsConsortium / mixs-rdf

Creative Commons Zero v1.0 Universal
3 stars 0 forks source link

Store as RDF versus TSV #10

Closed ramonawalls closed 3 years ago

ramonawalls commented 4 years ago

We need to decide how the canonical form of MIxS will be stored and edited. The two key contenders are RDF (ttl) or TSVs (which would be converted to ttl).

If ttl is native, could hand edit the file or use Protege. Initial conversion from current Excel sheet could be done with Robot.

If stored as TSVs, could use ROBOT to build the ttl file. Tables might be easier for non-ontologist to edit, but may be more error prone.

Advantage of RDF is that it would be easier to build in structure for, e.g., terms that have controlled vocabularies (CVs) or external ontologies as expected values.

In addition to the main MIxS vocabulary, we need to store some internally maintained CVs.

Storing as RDF might make validation easier, as we could use SHACL.

jdeck88 commented 4 years ago

my vote is for tsv's --- this will be far easier for non-ontologists (ie, 99.9999% of the population) to interact with. we can also validate the tsv's prior to ttl conversion.

jjkoehorst commented 4 years ago

Indeed, another vote for TSV, internally we also use excel to register samples and metadata which is then converted to RDF using internal java based parsers according to defined ontologies. We cannot expect people to read/write and learn new software when excel is often complex enough.

only1chunts commented 4 years ago

@jdeck88 talks sense (at least on this occasion ;-) ) Having the ability to give to all/any of the GSC collaborators/committee etc to check/validate prior to committing is a very useful feature of TSV files. I'm sure I could wade through a ttl file and work it out, but for preference a simple TSV makes life easier. In short, my vote is also for TSV.

cmungall commented 4 years ago

I agree that most people would prefer to interact with TSVs or spreadsheets. But we can export to TSV, web pages, forms, etc. And note as @ramonawalls says submissions could come in via tsv and be converted.

In GSC, how many people actually do direct editing and manipulation on the profiles? How is this done at the moment? Do people email around excel files, work on google sheets...?

How do people do things such as align fields across the profiles right now?

Perhaps if it is a smaller group doing the editing and a larger group commenting then managing in rdf/owl (e.g. editing with Protege would make sense)? There could be a travis job that compiles into various other formats for browsing/read-only consumption.

The advantage of editing in a semantic format is increased expressivity. I assume the plan is to encode the different constraints and value sets in a more structured computable way than is currently being done. What is the proposed tsv format for this?

cmungall commented 4 years ago

Of course, ANYTHING is better than excel files

pbuttigieg commented 4 years ago

Agree with @cmungall - TSVs can be exported from an RDF artifact, but aren't especially helpful with the machine readability / interoperability (that's what RDF is made for if used well).

pbuttigieg commented 4 years ago

The TSV --> RDF path also means that we may lose or have to re-develop the native validation and ID management that comes with many RDF/OWL editors. Not insurmountable, but a possible issue.

ramonawalls commented 3 years ago

We will most likely use linkml as our canonical storage.