catalystneuro / format-support-table

A summary of format support in the NWB ecosystem
https://catalystneuro.github.io/format-support-table/
1 stars 1 forks source link

Add .json container of table data #3

Closed CodyCBakerPhD closed 1 year ago

CodyCBakerPhD commented 1 year ago

@garrettmflynn Here is the first example of that JSON reduction of the table data for Ecephys

CodyCBakerPhD commented 1 year ago

Python code for parsing the exported temporary (and simplified) .tsv

import json
from pathlib import Path

from pandas import read_table
from pandas.io.json import build_table_schema

table_path = Path("C:/Users/Raven/Downloads/Ecosystem Format Support v3 - Simplified - Ecephys.tsv")

table = read_table(filepath_or_buffer=table_path)

json_serialization = table.T.to_json()
json_table = list(json.loads(json_serialization).values())

json_path = table_path.parent / (table_path.stem + ".json")
with open(file=json_path, mode="w") as io:
    io.write(json.dumps(json_table, indent=4))

json_schema = build_table_schema(data=table)

json_schema_path = table_path.parent / (table_path.stem + "_schema.json")
with open(file=json_schema_path, mode="w") as io:
    io.write(json.dumps(json_schema, indent=4))
CodyCBakerPhD commented 1 year ago

Minor note: The 'schema' is a table schema format not the usual JSON form format; it doesn't seem to capture the optionality of the version column as 'string' or 'null'

CodyCBakerPhD commented 1 year ago

Once we get the ecephys version of this polished to both our liking, then we can quickly handle the other modalities in follow-ups

garrettmflynn commented 1 year ago

If we want to keep this indexed format, I'd suggest changing the encapsulating object into an array—though I imagine this may be incompatible with the table schema format:

[
    {
        "Format": "AlphaOmega",
        "Versions": null,
        "Suffix(es)": ".mpx",
        "Example Data": true,
        "Neo - Raw IO": true,
        "Neo - Tests": true,
        "SpikeInterface - Extractor": true,
        "SpikeInterface - Tests": true,
        "NeuroConv - Interface": true,
        "NeuroConv - Tests": true,
        "NWB GUIDE": false
    }
]

Otherwise for further simplification of the parsing, I was thinking we could further simplify the structure and unnecessary data specification:

{
    "AlphaOmega": {
        "Suffix(es)": ".mpx",
        "Example Data": true
        "Neo": {
                 "RawIO": true,
                 "Tests": true
        },
        "SpikeInterface": {
                 "Extractor": true,
                 "Tests": true
        },
        "NeuroConv": {
                 "Interface": true,
                 "Tests": true
        }
    }
}

Where null or false values are implicit and final headers can be aggregated based on all the keys that are used in the structure.

You mentioned being maximally explicit. Is this why you're steering clear of a system like this?

CodyCBakerPhD commented 1 year ago

If we want to keep this indexed format, I'd suggest changing the encapsulating object into an array—though I imagine this may be incompatible with the table schema format:

Looking more into it again: https://www.bluefeathergroup.com/docs/accordion-tables/the-json-table-structure/example-simple-table/

It seems there's a fair amount of freedom in how to JSON-ify the table. What you see here is merely the direct output of pandas convenience functionality, I could coerce the more rigorous form with a bit of extra work

You mentioned being maximally explicit. Is this why you're steering clear of a system like this?

Partly - the true reason is because I'm enforcing PEP 20 principles

https://peps.python.org/pep-0020/

Explicit is the number one goal, and flatness also comes in later - sometimes nesting is better when things need to be validated against or with nearby associated values (mostly thinking pydantic there), but IDK about this case since we're not really doing much complicated validation

CodyCBakerPhD commented 1 year ago

Otherwise for further simplification of the parsing, I was thinking we could further simplify the structure and unnecessary data specification:

I figured that would only work if the schema specifies that a given columns values are optional (and to fill with null if missing)

Can you try parsing this current .json in a splinter branch and see if your parsers under the current form + how they render the output? (not sure how you'd generate/share the demo web page since the GitHub pages probably only renders from main not dev branches)

garrettmflynn commented 1 year ago

I'll make a fork to share the updates with you

garrettmflynn commented 1 year ago

I figured that would only work if the schema specifies that a given columns values are optional (and to fill with null if missing)

This is an entirely separate system from the GUIDE, so I technically don't need the schema at all and can simply make reasonable assumptions. So whether we'd like to keep the schema / actually use it for the generation is up to you.

CodyCBakerPhD commented 1 year ago

@garrettmflynn I'm referring to the generated table schema: https://github.com/catalystneuro/format-support-table/pull/3/files#diff-f8d9ebc30eeb55fde7a40523b2cb34bf94517bbca0a37f381f57e12fccd258c2R1, not anything in the GUIDE

Only if it's useful for your side, which I wasn't sure if it would be. Also because I figured there would be some area where I could attach column descriptions, but doesn't look like it

garrettmflynn commented 1 year ago

Sure. Yeah I just meant none of this code actually relies on it—and it looks like it won't really need to.

We can always bring it back later. This is very predictable for the moment.

garrettmflynn commented 1 year ago

Here's that fork serving from my updated add_json_table_ecephys branch: http://garrettflynn.com/format-support-table/

CodyCBakerPhD commented 1 year ago

@garrettmflynn That looks like it's parsing pretty easily then

I updated the file on this branch to use arrays instead of dictionaries as suggested

Also went ahead and removed schema file

I think for the future URL links it might be easier or make more sense to have a separate table of the same size where every cell has an optional URL. Then you treat that like a mask of the main table and you can figure a way to add hyperlinks around the elements (or w/e other way is easier)

garrettmflynn commented 1 year ago

I'd just say an entry can either be a value OR an object with a "value" key and any other metadata—which we can handle however we like.

Is that fine?

CodyCBakerPhD commented 1 year ago

I'd just say an entry can either be a value OR an object with a "value" key and any other metadata—which we can handle however we like.

I'll play with that in a follow-up; the linkage would make sense in that case to avoid shape/index mismatches, yes so that sounds like a good idea

garrettmflynn commented 1 year ago

Sweet I've updated my fork to read from the array-based json.

The only tweak I'd suggest is converting "Suffix(es)" and "Versions" to their singular form since each is associated with a format explicitly now, so isn't really a collection.

garrettmflynn commented 1 year ago

Actually, I just realized we are allowing multiple suffixes and versions. There just might be multiple rows per format anyways. Nevermind...

CodyCBakerPhD commented 1 year ago

@garrettmflynn Anything else for this PR? If not we can merge and then I'll generate similar .json for the other two tables