edamontology / edam-sandbox

Miscellaneous files for test purposes
1 stars 2 forks source link

Issue with format file extensions #19

Open joncison opened 4 years ago

joncison commented 4 years ago

From https://github.com/edamontology/edamontology/issues/421:

joncison commented 4 years ago

@matuskalas - a small detail - do we give e.g. ".txt" "txt" or both ? (prob. both?)

joncison commented 4 years ago

@albangaignard for my first foray in SPARQL, I'm tackling this query, which addresses (from above):

but I notice that the pattern for the file_extension property currently allows the use of | (pipe) as delimiter between multiple values, e..g yaml|yml.

While this is compact / looks nice, it rather complicates the semantics and downstream uses: file_extension currently means "A string in which one or more commonly used file extensions for a data format are delimited by pipe character(s)." rather than simply "A commonly used file extension for a data format."

I think @matuskalas the right course is to refactor EDAM so that one extension is given per file_extension? In which case the query becomes:

Thoughts please!

cc @hmenager @veitveit

joncison commented 4 years ago

PS. @albangaignard my hunch is that most or all the checks will require some Python programming, so your suggestion to use Jupyter notebooks is a very good one!

joncison commented 4 years ago

UPDATE

I just finished the query, taking the decision that only lowercase alphanumeric characters are allowed in EDAM Format file extensions. cc @matuskalas @veitveit

This being my first foray into Python and SPARQL in case you have time @albangaignard @hmenager or @hansioan I'd much appreciate some feedback on the quality of the code, which is included here (from this Juypter notebook).

joncison commented 4 years ago

Just added check that label or exact synonym is defined that matches the file extension, see this notebook

cc @albangaignard @hmenager