ENH: Array data in `.tsv` cells

sappelhoff commented 1 year ago

Potentially to be dealt with:

[x] https://github.com/bids-standard/bids-specification/pull/1605

Discussion originated in this thread:

https://github.com/bids-standard/bids-examples/pull/324#issuecomment-1423276788

Currently, cells in BIDS .tsv files may take on numeric or string values (see tabluar data).

Here I list some examples:

a string n/a, reflecting a missing value or that a value is not applicable for the present cell
a string mystr, that is further specified in the column_name.Levels object in a .json file that accompanies the .tsv file, where column_name refers to the name of the column in the .tsv file under which the string mystr occurs
floating point numbers, reflecting time since acquisition for the onset column in events.tsv
- often associated with a unit, defined in column_name.Units, analogously to the mystr example above

However, for some cases, it would be desirable to define an array of values in a particular .tsv cell. For example when a given event (in a row) in an events.tsv file is associated with multiple channels from a neural recording (e.g., EEG, iEEG, NIRS, ...):

This event at onset X and with duration Y is associated with channels A, B, and C

Currently, there is no formal way to specify such an array in BIDS. We think that it could be helpful for several data modalities (now and in the future) to have such a concept to work with.

In https://github.com/bids-standard/bids-examples/pull/324#issuecomment-1462816253, @effigies suggests that we use JSON arrays for this case (see his comment for advantages over more basic structures, like a comma-separated string).

An array is an ordered collection of values. An array begins with [ (left bracket) and ends with ] (right bracket). Values are separated by , (comma).

grafik

I think if we were to adopt JSON arrays as a valid input for .tsv cells, we would would benefit by (at least initially) restricting the kinds of values that may be listed in a given JSON array. Values are defined as such:

A value can be a string in double quotes, or a number, or true or false or null, or an object or an array. These structures can be nested.

For our present purposes, I think it would suffice to confine ourselves to:

string
number
boolean (true/false)

What are your opinions? Do you see other use cases apart from annotations in ephys data (in events.tsv files)? Please comment and advise.

VisLab commented 1 year ago

I always like to see what I'm in for downstream, so I did a couple of simple tests of the alternatives on Python (pandas DataFrame) and Matlab (Table), which I believe will be the two most common downstream data structures to hold tabular data for users.

After view the results (below and attached files), I would prefer option 3 of just having channels be [x , y, z]. It seems that the tabular input for standard tools sees the [ ] and immediately thinks this is a string.

Here is my code and the results. The three test files are attached. All of the Matlab results have outer { } because they all come back as a string in a cell array of size 1 x 1.

table1 = readtable('test1_events.tsv', 'Delimiter', 'tab', 'FileType', 'text');
table2 = readtable('test2_events.tsv', 'Delimiter', 'tab', 'FileType', 'text');
table3 = readtable('test3_events.tsv', 'Delimiter', 'tab', 'FileType', 'text');

Python:

table1 = pd.read_csv('test1_events.tsv', delimiter='\t')
table2 = pd.read_csv('test2_events.tsv', delimiter='\t')
table3 = pd.read_csv('test3_events.tsv', delimiter='\t')

Table	Row	Input	Python	Matlab
1	0	["P1", "P2", "P3"]	'["P1", "P2", "P3"]'	{'["P1", "P2", "P3"]'}
1	1	n/a	nan	{'n/a'}
1	2	["CP1"]	'["CP1"]'	{'["CP1"]'}
1	3	[1, 2, 3]	'[1, 2, 3]'	{'[1, 2, 3]'}
1	4	[1, "CP1", 2]	'[1, "CP1", 2]'	{'[1, "CP1", 2]'}
2	0	'[P1, P2, P3]'	'\'[P1, P2, P3]\''	{''[P1, P2, P3]''}
2	1	'n/a'	nan	{'n/a'}
2	2	'[CP1]'	'\'[CP1]\''	{''[CP1]''}
2	3	'[1, 2, 3]'	'\'[1, 2, 3]\''	{''[1, 2, 3]''}
2	4	'[1, CP1, 2]'	'\'[1, CP1, 2]\''	{''[1, CP1, 2]''}
3	0	[P1, P2, P3]	'[P1, P2, P3]'	{'[P1, P2, P3]'}
3	1	n/a	nan	{'n/a'}
3	2	[CP1]	'[CP1]'	{'[CP1]'}
3	3	[1, 2, 3]	'[1, 2, 3]'	{'[1, 2, 3]'}
3	4	[1, CP1, 2]	'[1, CP1, 2]'	{'[1, CP1, 2]'}

test_event_files.zip

@christinerogers @arnodelorme

christinerogers commented 1 year ago

Thanks @sappelhoff and @VisLab . My EEGNet colleague Tyler @andesha Collins will weigh below, and I'll re-iterate part of our concern with embedding json in a tsv --:

It adds challenges for researchers trying to BIDS-ify their data, since it's harder to read and format in a .tsv cell, more prone to formatting errors, and can't be validated as easily (afaik) - a recipe for frustration and attrition in data sharing.

I understand the reasons why embedded json can be preferred, and at the same time it would be nice to also aim for reducing barriers in onboarding many new research groups to BIDS (as my group does with large open-source projects like LORIS and data-sharing platforms like EEGNet).

p.s. @sappelhoff per my original question in the BIDS Maintainers meeting March 9 - We can't be the only corner of the BIDS world looking at this, can we? who else should be consulted in from other specs?

Remi-Gau commented 1 year ago

Note to self: check how octave + bids-matlab will deal with this.

effigies commented 1 year ago

Some notes on Kay's comment:

I would suggest that we don't allow heterogeneous arrays, and instead say the type is "array of strings". Encoded in JSON it would be:

onset       duration    trial_type  channels
3.452       n/a         blink       ["P1", "P2", "P3"]
6.112       0           show        n/a
6.82        0.2         response    ["CP1"]
7.345       n/a         start       ["1", "2", "3"]
8.543       5.0         mixed       ["1", "CP1", "2"]
9.112       0           show        []

(I've added another row to show an empty list.)

There are two approaches with Pandas (note that the reprs don't make clear the string versus object; if "P1" shows up, the whole thing is a string; if P1 shows up it is in a list of strings):

Convert on load

```Python >>> import pandas as pd >>> import json >>> import math >>> def loader(val): ... if val == 'n/a': ... return None ... return json.loads(val) ... >>> table = pd.read_csv('test4_events.tsv', delimiter='\t', converters={'channels': loader}) >>> table onset duration trial_type channels 0 3.452 NaN blink [P1, P2, P3] 1 6.112 0.0 show None 2 6.820 0.2 response [CP1] 3 7.345 NaN start [1, 2, 3] 4 8.543 5.0 mixed [1, CP1, 2] 5 9.112 0.0 mixed [] ``` 1, 2 and 3 are strings here.

Load then convert

```Python >>> import pandas as pd >>> import json >>> def from_default(val): ... if isinstance(val, str): ... return json.loads(val) ... return val ... >>> table = pd.read_csv('test4_events.tsv', delimiter='\t') >>> table onset duration trial_type channels 0 3.452 NaN blink ["P1", "P2", "P3"] 1 6.112 0.0 show NaN 2 6.820 0.2 response ["CP1"] 3 7.345 NaN start ["1", "2", "3"] 4 8.543 5.0 mixed ["1", "CP1", "2"] 5 9.112 0.0 mixed [] >>> table.channels = table.channels.apply(from_default) >>> table onset duration trial_type channels 0 3.452 NaN blink [P1, P2, P3] 1 6.112 0.0 show NaN 2 6.820 0.2 response [CP1] 3 7.345 NaN start [1, 2, 3] 4 8.543 5.0 mixed [1, CP1, 2] 5 9.112 0.0 mixed [] ``` 1, 2 and 3 are also strings here.

The other thing we considered was a simple comma-separated string, so it would be

onset       duration    trial_type  channels
3.452       n/a         blink       P1,P2,P3
6.112       0           show        n/a
6.82        0.2         response    CP1
7.345       n/a         start       1,2,3
8.543       5.0         mixed       1,CP1,2
9.112       0           mixed       n/a

This would be easy enough to deal with simply by using ','.split(val) in our converter. Note, that there is no distinction between the integer 1 and the string "1". If channels for some reason had numeric names, pandas would by default convert 1 to an integer, but 1,2 to a string.

When suggesting JSON-encoded strings, I assumed that these would not be hand-edited but generated from spreadsheets or annotation tools. If this will be tooling mediated, the JSON will be better structured and the validator could distinguish numbers from strings and give errors. If people are going to be writing these by hand, then I think the comma-separated strings will be the easiest.

robertoostenveld commented 1 year ago

Just some wild thoughts:

Besides it being JSON, what would be the advantage of using the [ and ] brackets around the list? And why use " quotes around each item by default? The 2nd example in the previous comment is simpler and also looks fine to me (assuming that the horizontal spaces correspond to tabs, which they technically don't in this case). I realize that I am touching here upon human- versus machine-readability.

Possible answers to myself:

The [] brackets facilitate parsing nested lists like [1, 2, 3, [4.1, 4.2]], which would be impossible without them.

The " quotes allow specifying characters that otherwise would be parsed incorrectly, like ["{", "(", "["], or strings containing a tab. This would not be specific to the lists, it would also apply if I now wanted to represent an item with a tab in it, where the tab needs to be escaped somehow to prevent it from being interpreted as column separator.

Why would we use a different separator inside the list (comma) than outside the list (tab); that could also both be a tab. Yup, I realize this makes parsing harder, both for human and machine. But would that mean that if we have a nested list, we would use yet another separator (e.g., ;) at the 2nd level?

I like the idea what the downstream consequences are when people use software to process the data. Some softwares that I suggest to consider (besides native Python, Python+Pandas, MATLAB ) are R, SPSS, Excel, ...

Andesha commented 1 year ago

I hadn't personally considered nested lists as part of my suggestion based on the simple List format...

@robertoostenveld Is that something that we would need to support? :thinking:

jadesjardins commented 1 year ago

The comma separated list inside of brackets seems to be the most simple (and safest) representation that fits the range of use cases that I am aware of.. from what I can tell, things like adding the quotations around items inside the brackets would only allow the users to do things like have commas in channels names for example (which I think I am okay with restricting).

effigies commented 1 year ago

Talking with @sappelhoff, one of the problems with comma-separated channels is that in practice channels might have commas in their names. I can see two approaches that would allow custom delimiters that are only incrementally more complex than comma-separated strings.

Syntax	Example	With channels containing commas
Comma-separation	`P1,P2,P3`	Not allowed
Wrapped delimiters	`,P1,P2,P3,`	`\\|P1,2\\|P2\\|P3\\|`
Prefixed delimiter	`,:P1,P2,P3`	`\\|:P1,2\\|P2\\|P3`

These could be implemented:

def wrapped_delimiters(val):
    if isinstance(val, str):
        delimiter = val[0]
        if not delimiter.isalnum() and val[-1] == delim:
            return val[1:-1].split(delimiter)
    return val

def prefixed_delimiters(val):
    if isinstance(val, str) and len(val) > 2:
        if val[1] == ":":
            delimiter = val[0]
            return val[2:].split(delimiter)
    return val

Unless there's some globally safe delimiter (or we're willing to force people to mass-rename channels in their data), A dynamic delimiter or escape characters are likely the only options. One additional step of complexity is an overridable default:

Syntax	Example	With channels containing commas
Overridable delimiter	`P1,P2,P3`	`\\|P1,2\\|P2\\|P3\\|` or `\\|:P1,2\\|P2\\|P3`

robertoostenveld commented 1 year ago

I hadn't personally considered nested lists as part of my suggestion based on the simple List format...

@robertoostenveld Is that something that we would need to support? 🤔

No, I do not see the use case for it at the moment. But then, I also did not see the use case in the past for lists within one cell of the TSV table.

Nevertheless, "Simple is better than complex", so I think that we should not plan for the future, but do what makes sense now. Personally I like the comma-separated list and if I need it I would simply avoid comma's in channel names (I do that anyway, I only use [A-Za-z0-9], - and _). And the overridable default (e.g., with |) makes that even better.

sappelhoff commented 1 year ago

Thanks for the suggestion, Chris. I like the idea of delimited lists where the delimiter is wrapped and may be freely selected from a list of defined delimiters (like ,|;.+- or so). I am not sure I like the idea of just having comma-separated lists as the default and only optionally allowing for "overridable delimiters via wrapping (or prefixing)", because this seems like an unneeded branching to me: It's not a hassle to do ,P1,P2, instead of P1,P2.

VisLab commented 1 year ago

I'm glad this discussion is progressing towards a resolution. I don't see a use case for nested lists at this time and don't think we should address this until there is a need for it.

Either the wrapped delimiter solution or the overridable delimiter solution is fine.

dorahermes commented 1 year ago

This is great, thank you, either the wrapped delimiter solution or the overridable delimiter solution should work for our use cases.

arnodelorme commented 1 year ago

Personally, I prefer the simple format of separated values with commas. P1,P2,P3 Otherwise, it will be confusing for non-programmers.

BIDS is supposed to be readable by users. Numerical array can be 1,2,3 and if you want string "1","2","3"

effigies commented 1 year ago

Okay, if you need to distinguish numbers and strings, then that makes sense, but you're now two characters away from a JSON string, which I thought we were trying to avoid. Adding in quoted strings makes this a parsing problem and syntax errors will be possible.

Andesha commented 1 year ago

I think instead of distinguishing between an int and a string type within a List, we can leave that up to how we describe the items inside of the List in the sidecar files. This comes later in my mind, but this way people can point the correct (or safe) parser at a List of an expected type.

ericearl commented 1 year ago

I take comfort in @robertoostenveld 's comment (and the majority here of array data CSV cells in TSV files supporters):

Nevertheless, "Simple is better than complex", so I think that we should not plan for the future, but do what makes sense now. Personally I like the comma-separated list and if I need it I would simply avoid comma's in channel names (I do that anyway, I only use [A-Za-z0-9], - and _). And the overridable default (e.g., with |) makes that even better.

I also like a comma-separated list's simplicity for "array data" inside of a TSV cell (which is trying to work around the spirit of a TSV already) and the acknowledgment that anything you might put inside a CSV array inside an otherwise TSV file should be absent of commas. Names can use a delimiter that is not a comma, such as /, |, ;, :, ., +, or -.

@effigies Do you have a sense of how many datasets or folks would be affected by the need "to mass-rename channels in their data"? Are channels curated into these TSV files in such a way that they cannot be renamed easily upon curation?

VisLab commented 1 year ago

In EEG I have only seen [A-Za-z0-9], - and _. I don't know about eye tracking and motion capture.

sappelhoff commented 1 year ago

@effigies Do you have a sense of how many datasets or folks would be affected by the need "to mass-rename channels in their data"? Are channels curated into these TSV files in such a way that they cannot be renamed easily upon curation?

It's not common to have non-alphanumeric characters in channel names (except maybe a space character ... not sure about the statistics here and what's "common"). However it does happen -- I have seen it raised in issues on ephys software, and for example, BrainProducts explicitly mention in their spec, how commas are supposed to be escaped:

; Commas in channel names are coded as "\1".

(taken from the BrainVision data format spec)

And given that the suggestion with wrapped delimiters is in my opinion a robust and easy-to-understand solution, I see no reason why we should force people to rename their channels.

I think instead of distinguishing between an int and a string type within a List, we can leave that up to how we describe the items inside of the List in the sidecar files.

I currently don't see a need for "mixed" arrays (str and int/float), so I agree with the suggestion to define it in the JSON, which data type is supposed to be expected.

However having that said, if the need ever does arise, then a JSON array probably would be the best solution for us.

sappelhoff commented 1 year ago

Another proposal could be to define the delimiter of a .tsv cell in the accompanying .json file.

See this TSV file:

onset	duration	my_channels1	my_channels2
5	1	P1,P\|2,R 23,XX12°	Op,1\|Op,2

and this accompanying JSON file:

{
   "my_channels1":{
      "LongName":"My Channels 1",
      "Description":"The first batch of channels associated with a given event.",
      "Delimiter":","
   },
   "my_channels2":{
      "LongName":"My Channels 2",
      "Description":"The second batch of channels associated with a given event.",
      "Delimiter":"|"
   }
}

We would basically extend the list of "defined JSON keywords to describe TSV columns" (current doc on this: https://bids-specification.readthedocs.io/en/latest/common-principles.html#tabular-files) by an additional keyword: Delimiter.

If Delimiter is defined, the respective column in the TSV file is assumed to be a cell array and must be split according to the delimiter that is defined as the string value to the Delimiter key.

In most cases, we will have {"Delimiter": ","} ... but we have a way to define other delimiters if needed.

EDIT: Or we think in this general direction and make this a key like so:

{
    "CellArray": {
        "Delimiter": ","
        "DataType": "int"
    }
}

and so on ... I propose this just as a new direction for our thoughts, and because I am worried that a simple "comma separated list" will not be a good idea to capture the heterogeneity that we have with data.

effigies commented 1 year ago

@sappelhoff I get where you're coming from with explicit delimiter metadata, but I think putting parsing instructions into JSON sidecars would be extremely error-prone. I can't think of an existing case where the data file cannot be fully parsed without reference to the sidecar, and TSV files in general have not had mandatory sidecars. I suspect you'll end up with a lot of tools just assuming comma-separated values.

Not to cut the conversation short, but does it make sense to put forward some questions for a vote? I have a sense that there's at least a camp that's pretty settled on comma-separated-unquoted-strings, and if that camp is larger than those trying to find an acceptable dynamic delimiter, we may just be wasting time coming up with possibilities.

sappelhoff commented 1 year ago

Thanks for your perspective @effigies I agree. I proposed this because I am just really unhappy with the plain comma-separated list, as I the wrapped delimiter solution is so much more versatile at a minuscule increase in "complexity".

I think a vote is a good idea.

ericearl commented 1 year ago

Just to clarify what is being voted upon, it sounds like it should either be "Comma-separation" or "Wrapped delimiters" that are used to create an array of values inside any TSV cell. A brief summary based on discussion in this issue is below. I think it is important to note that any way you want to use delimiters for an array in a TSV cell, it will always be interpreted as a string which you would have to parse in your software.

Comma-separation

Use commas as the delimiter to separate values in an array.

A common example: P1,P2,P3 Simple, but effective (most of the time).
The obvious edge case: P1.2,P2,P3 Causes a need to change your channel names if any contain commas.

Wrapped delimiters

If the first and last character are non-alphanumeric and identical, then that character is identified as the delimiter and used to split the string. That would permit the user to select whatever they find most readable, given their channel names.

A common example: |P1,2|P2|P3| Pipes like these could be used as the delimiter which would allow characters like commas to still be in there.
Another possible example: ,P1,P2,P3, Commas could still be used, but just so any downstream software using this "array in TSV cell" format knows, you prefix and suffix your comma-separated lists with the comma delimiter.

@sappelhoff @effigies Did I miss anything else important?

dorahermes commented 1 year ago

Would it perhaps be possible to clarify 2 things before voting:

1) If this is a change not just about lists of channels, but a general change to BIDS .tsv files having any kind of list/array data, it seems like the corresponding sidecar 'should/must/?' have some information on the fact that there is a list/array rather than a string in this .tsv column. The 'delimiter' field in the json sidecar would indicate this. Would there be a possibility for a 'Wrapped delimiters' defaulting to comma separated? If the curator wants the text string to be parsable as a list, they 'should/must/?' add the delimiter field in the sidecar.

2) In case we want to go simple now: a comma separated list, but would like to keep options for the future, to me ,P1,P2,P3, is the extremely similar to P1,P2,P3. Which of these would be the most backwards compatible solution?

effigies commented 1 year ago

If this is a change not just about lists of channels [...]

At the moment this is just about lists of channels. I think this would end up as instructions that will show up in a column description table, not something to go in "Common Principles". The most this would do is create a precedent, making it more likely that the next column to have an array type (if there ever is one) will have whatever format we declare here, but it wouldn't be so strong that we couldn't make a different choice.

In case we want to go simple now: a comma separated list, but would like to keep options for the future, to me ,P1,P2,P3, is the extremely similar to P1,P2,P3. Which of these would be the most backwards compatible solution?

I think we're talking 2-3 lines of (Python) code difference if we picked one vs the other and then later wanted to add dynamic delimiting. P1,P2,P3 is less constraining on what a dynamic delimiter string would look like, while ,P1,P2,P3, means you could support dynamic delimiters today because you're pretty sure what they would look like if added.

For what it's worth, here are the questions I wrote up (but didn't post) last time:

1) Dynamic delimiters: Acceptable (grudging acceptance counts) or unacceptable? 1) If accepted: 1) Should the delimiter be declared in the value (e.g., ,A,B,) or in a JSON sidecar? 1) Should there be a default delimiter if a "declaration" is absent? 1) If rejected: 1) Should we declare a transformation for channel names containing commas? 1) If so, do we need to come up with alternatives, or is \1 fine?

Would getting votes on these be sufficiently clarifying?

christinerogers commented 1 year ago

@sappelhoff your original suggestion would be to circulate this to the bids community (off github?) -- Seems like we're ready? cc @effigies you have 7 upvotes so far!

Thanks also @dorahermes above for circling back to visibility and generalization across all BIDS (hence this spin-off thread) -- Given other parts of the BIDSverse will encounter this issue (as I mentioned at BIDS-maintainers last month) there's a big difference between a precedent buried in a spec vs. pointed to visibly from a central place (just as a precedent/option) vs. making it a could/should (e.g. in common principles). Can we take the middle path?

@effigies @sappelhoff let us know if you need help/ideas for increasing visibility or linkages across modalities etc.

sappelhoff commented 1 year ago

your original suggestion would be to circulate this to the bids community (off github?)

I have posted this to the bids-discussion mailing list. Feel free to spread the word further, to parties you think would be interested and/or able to constructively participate in the discussion.

Overall I am interested in making this new "feature" as future- and cross-modality/use-case proof as possible, and not have it as an "ephys-only quirk".

VisLab commented 1 year ago

Dynamic delimiters: Acceptable (grudging acceptance counts) or unacceptable?

If accepted:

Should the delimiter be declared in the value (e.g., ,A,B,) or in a JSON sidecar?

Should there be a default delimiter if a "declaration" is absent?

If rejected:

Should we declare a static transform?

If so, do we need to come up with alternatives, or is \1 fine?

Here is my vote:

I'm good with 1) dynamic delimiters. Would greatly appreciate there being a default. I think that overriding the default should be in the JSON file. The bracketed delimiters are hard to read. 99.999% of our current use cases will be covered by 1) with the default of comma.

Now to show my ignorance... @sappelhoff: on 3) what is a static transform?

effigies commented 1 year ago

what is a static transform?

Sorry, I didn't write that clearly. I meant some way to replace commas to avoid ambiguity, such as with \1 as mentioned in https://github.com/bids-standard/bids-specification/issues/1446#issuecomment-1485270030.

Edit: Updated it to read "Should we declare a transformation for channel names containing commas?"

dorahermes commented 1 year ago

I vote 1) with @VisLab since it covers most of our use cases as well.

Only question I have left is whether the channel column always contains a list or whether this should be specified in the .json sidecar?

ericearl commented 1 year ago

I think I agree basically the same as @VisLab and @dorahermes.

Dynamic delimiters are acceptable.
The "array in a cell" delimiter should be declared in the JSON sidecar.
The default delimiter should be a comma when the delimiter declaration in a JSON sidecar is absent.

Andesha commented 1 year ago

Agree based on comments from @VisLab and @dorahermes

christinerogers commented 1 year ago

:+1: with @vislab and @dorahermes and @Andesha 's votes and @ericearl's helpful clarification

@arnodelorme - for EEGNet this seems to be the most workable, a :+1 from you as well? Does this break any assumptions in your view?

cc @Andesha @jeffersoncasimir @smakeig

CPernet commented 1 year ago

Using something like ,P1,P2,P3, seems the easiest for humans and machines alike, so +1 from me. Option to specify the delimiter in JSON why not -- now of course it means no nesting, but if no one sees/has a case of nesting, let's not worry about it.

arokem commented 1 year ago

Hello! The whole idea of adding arrays to a cell of a tsv file seems a bit unsavory to me, because it would takes what is otherwise a tidy table (in this sense) and make it a bit more messy, as is apparent in this discussion.

Another solution that comes to my mind is to distribute the array data across multiple rows with all other column values keeping exactly the same value. I believe that this comes with no loss of information, and retains the tidy format of the table. And IIUC, requires less obtrusive changes to the spec (if at all?).

For example, the table in this comment would become:

onset       duration    trial_type  channels
3.452       n/a         blink       P1
3.452       n/a         blink       P2
3.452       n/a         blink       P3
6.112       0           show        n/a
6.82        0.2         response    CP1
7.345       n/a         start       1
7.345       n/a         start       2
7.345       n/a         start       3
8.543       5.0         mixed       1
8.543       5.0         mixed       CP1
8.543       5.0         mixed       2
9.112       0           mixed       n/a

For disambiguation, we could add a column that explicitly highlights that these rows together form an array (a Boolean) and even the index within the array (i.e., integer values: 0, 1, 2). Does this preclude any use-cases that this proposal is designed to meet?

Andesha commented 1 year ago

In my opinion the strategy of breaking up the events across multiple lines has the problem of how to group them back together. For example, the table requires post processing to recognise the first three events are the "same" just on different channels. Further, it does not account for the situation of: "this event is an artifact that is present across these three channels" versus "there are three different artifacts separate across three channels at the same time". This makes the previously mentioned grouping challenging.

Lastly, I agree that this breaks the "tidy" assumptions. I feel it is necessary though. Pandas does have functions for manipulating data like this through explode.

smakeig commented 1 year ago

In the HED Working Group, we are considering how to explicitly enter lines in the events.tsv files for non-onset markers into event processes - for example, markers of the onsets of syllables in a spoken word event. HED will refer to these as 'Inset' event markers. Explicit 'Offset' event markers (also occupying a line in the .tsv file) are also being supported. Here, a new (optional) column in the .tsv file, 'Marker_type' can be used - in this case now-standard event onset entries in the .tsv file should be referred to as being an event 'Onset' marker_type. This advance in event annotation raises the same issue of how to group event markers to refer to the same event (process). This requires another (optional) column in the .tsv file titled 'Event_index'. Currently, we are thinking to allow the user to use their own event indexing scheme (including the obvious possibility, monotonically increasing event number). I have no opinion about the proposal to allow interpretation of vector entries. For small (short) vectors, thsi doesn't seem a problem - for long vectors, however, ...

Scott Makeig

On Thu, Apr 13, 2023 at 10:22 AM Tyler Collins @.***> wrote:

In my opinion the strategy of breaking up the events across multiple lines has the problem of how to group them back together. For example, the table requires post processing to recognise the first three events are the "same" just on different channels. Further, it does not account for the situation of: "this event is an artifact that is present across these three channels" versus "there are three different artifacts separate across three channels at the same time". This makes the previously mentioned grouping challenging.

Lastly, I agree that this breaks the "tidy" assumptions. I feel it is necessary though. Pandas does have functions for manipulating data like this through explode https://urldefense.com/v3/__https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html__;!!Mih3wA!F0_1OIYog4xEqCW4lRtXmU6IhPp8s5eLK5fhi7KAbj-UOCQlrfa1aRuEm-6GNvBgdAqF1cJps1zKaXuU9FJqgOae$ .

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/bids-standard/bids-specification/issues/1446*issuecomment-1507061592__;Iw!!Mih3wA!F0_1OIYog4xEqCW4lRtXmU6IhPp8s5eLK5fhi7KAbj-UOCQlrfa1aRuEm-6GNvBgdAqF1cJps1zKaXuU9JDtJUZc$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKN2SFTPPM37ERMQWIXV2DLXBAD3XANCNFSM6AAAAAAV5D2DSA__;!!Mih3wA!F0_1OIYog4xEqCW4lRtXmU6IhPp8s5eLK5fhi7KAbj-UOCQlrfa1aRuEm-6GNvBgdAqF1cJps1zKaXuU9DEGcUK1$ . You are receiving this because you were mentioned.Message ID: @.***>

-- Scott Makeig, Research Scientist and Director, Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California San Diego, La Jolla CA 92093-0559, http://sccn.ucsd.edu/~scott

arokem commented 1 year ago

Thanks for the comments @Andesha. Regarding this one:

Pandas does have functions for manipulating data like this through explode.

I think that this could work both ways. On the one hand, it means that a vector cell could be interpreted provided software that knows how to interpret it (but the complexity of that is apparent in the above discussion). On the other hand, it also suggests that (almost?) anything that can be represented using vectors could be represented in a tidy table, provided informative columns. For example, event identifier columns, as suggested by @smakeig in his comment.

smakeig commented 1 year ago

Yes, Kay Robbins has demonstrated using events.tsv columns to reveal the statistical design of the experiment. For example in a simple 2X2 design ({loud, soft}X{beep,boop}) specifying the nature of the event (onset) using two columns (Loudness, Pitch), abetted by HED definitions of the (2x2) levels of these two variables in an events.json file (typically at the top level of the BIDS file hierarchy) allows HED tools to compile a full HED annotation for the events in the dataset automatically. I suppose it might be possible to allow vector column entries along with (already supported) multicolumn entries? (the key there being that the definitions of the columns and their values are given in the .json file). -Scott

On Thu, Apr 13, 2023 at 7:22 PM Ariel Rokem @.***> wrote:

Thanks for the comments @Andesha https://urldefense.com/v3/__https://github.com/Andesha__;!!Mih3wA!EqHIYCHvmOSoL6IEwSLEgt0DGEx8Ka3bAXamy51zHOBMWI8C67cf75l69XbvnBDpkNdaj0pChhEcB5ReEUStLqMy$. Regarding this one:

Pandas does have functions for manipulating data like this through explode https://urldefense.com/v3/__https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html__;!!Mih3wA!EqHIYCHvmOSoL6IEwSLEgt0DGEx8Ka3bAXamy51zHOBMWI8C67cf75l69XbvnBDpkNdaj0pChhEcB5ReEfzLhmoN$ .

I think that this could work both ways. On the one hand, it means that a vector cell could be interpreted provided software that knows how to interpret it (but the complexity of that is apparent in the above discussion). On the other hand, it also suggests that (almost?) anything that can be represented using vectors could be represented in a tidy table, provided informative columns. For example, event identifier columns, as suggested by @smakeig https://urldefense.com/v3/__https://github.com/smakeig__;!!Mih3wA!EqHIYCHvmOSoL6IEwSLEgt0DGEx8Ka3bAXamy51zHOBMWI8C67cf75l69XbvnBDpkNdaj0pChhEcB5ReEZ3qR8V8$ in his comment.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/bids-standard/bids-specification/issues/1446*issuecomment-1507723742__;Iw!!Mih3wA!EqHIYCHvmOSoL6IEwSLEgt0DGEx8Ka3bAXamy51zHOBMWI8C67cf75l69XbvnBDpkNdaj0pChhEcB5ReERO9A-0W$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKN2SFQQMI5PGKJG3YYRJNDXBCDD5ANCNFSM6AAAAAAV5D2DSA__;!!Mih3wA!EqHIYCHvmOSoL6IEwSLEgt0DGEx8Ka3bAXamy51zHOBMWI8C67cf75l69XbvnBDpkNdaj0pChhEcB5ReEW8_z4bp$ . You are receiving this because you were mentioned.Message ID: @.***>

-- Scott Makeig, Research Scientist and Director, Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California San Diego, La Jolla CA 92093-0559, http://sccn.ucsd.edu/~scott

robertoostenveld commented 1 year ago

Another solution that comes to my mind is to distribute the array data across multiple rows with all other column values keeping exactly the same value

For me that would work.

christinerogers commented 1 year ago

per @andesha's response to @arokem - This is going to get a little messy no matter which way we slice it.

For data platforms (like LORIS and EEGNet.org), distributing an event over multiple rows is problematic (normalization etc) and also leads to more user typos, and zeroing out repeated values is tricky too.
@robertoostenveld for this reason it seems less workable from our perspective.

-- @effigies @sappelhoff Is there any precedent in BIDS for either following tidyverse or nesting a list within a column ?

or is it time for a wider recap/survey?

p.s. @smakeig agreed this might grow further, and to support HED tagging at a basic level, I think we'd shoot for the simplest approach needed currently here.

effigies commented 1 year ago

Is there any precedent in BIDS for either following tidyverse or nesting a list within a column ?

I'm not aware of any precedent.

sappelhoff commented 1 year ago

Is there any precedent in BIDS for either following tidyverse or nesting a list within a column ?

I'm not aware of any precedent in BIDS either.

The tidy suggestion (maybe together with an index column that "binds rows together" (is this what you mean by "event identifier columns" above?)) sounds like a reasonable alternative to the wrapped delimiter suggestion to me.

(A combination of both would even easily allow us a one-level nesting 😉, but I'm not advocating for that)

I'm not entirely sure (yet) how this constitutes a problem for loris, eegnet, or HED. And would love to see an example of the potential problems.

robertoostenveld commented 1 year ago

Is there any precedent in BIDS for either following tidyverse or nesting a list within a column ?

As HED was brought up, is that itself not a precedent? See https://github.com/hed-standard/hed-examples/blob/main/datasets/eeg_ds003645s_hed_column/sub-002/eeg/sub-002_task-FacePerception_run-1_events.tsv where the HED column contains a comma-separated list.

VisLab commented 1 year ago

Yes, there is the precedent of the HED column in the event file. Each entry in the HED column is treated as a string as far as BIDS is concerned. The hed-javascript validator verifies the actual contents in detail as part of bids-validation.

From an annotator's perspective, putting each channel on a separate line in the event file is a nightmare, especially when you want to annotate different features that start at the same time, many channels are included, and the feature needs to be identified with markers a starting time and an ending time (which will be the predominant use case for which this question arose) and you don't know what the ending point is when you identify the starting point.

One possibility (not necessarily ideal) is to pass this off to the HED Working Group to deal with. From the BIDS perspective, the list column entry would be treated as a string. Users would need to provide HED annotation in an events.json file (or as part of the HED column in the events.tsv file) using the Collection tag. The string would need to conform to the syntax HWG imposes on the values for the Collection tag (which haven't been specified yet). @Scott Makeig @.***> @christinerogers @dorahermes.

On Sat, Apr 15, 2023 at 3:52 AM Robert Oostenveld @.***> wrote:

Is there any precedent in BIDS for either following tidyverse or nesting a list within a column ?

As HED https://bids-specification.readthedocs.io/en/stable/appendices/hed.html was brought up, is that itself not a precedent? See https://github.com/hed-standard/hed-examples/blob/main/datasets/eeg_ds003645s_hed_column/sub-002/eeg/sub-002_task-FacePerception_run-1_events.tsv where the HED column contains a comma-separated list.

— Reply to this email directly, view it on GitHub https://github.com/bids-standard/bids-specification/issues/1446#issuecomment-1509652009, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJCJOU3KBOTSBJ2X4JYY6LXBJOWDANCNFSM6AAAAAAV5D2DSA . You are receiving this because you were mentioned.Message ID: @.***>

christinerogers commented 1 year ago

Agreed @robertoostenveld @VisLab - consistency with what has already been accepted by BIDS makes sense vs adding tidyverse compliance into the mix.

@arokem thoughts? Cc @Andesha for eegnet

dorahermes commented 1 year ago

While it would be nice to keep the .tsv files tidy, I am not sure whether the one line version is always the easiest to human read. Putting each channel on a separate line in the event file is extremely cumbersome as in the iEEG case seizures can start on a few channels and quickly spread to many channels, which would add many lines for 1 event in the _events.tsv. This is the particular use-case that we have in mind now.

HED columns containing a comma-separated list may be a precedent?

christinerogers commented 1 year ago

Thanks @dorahermes -

@effigies @sappelhoff there was discussion in the BIDS maintainers meeting that we had enough consensus to move forward and note this elsewhere in the BIDS ecosystem. Any thoughts?

CPernet commented 1 year ago

note that 'we' (the steering group) have nothing against the proposed solution - just when we discussed it, @arokem came up with repeated line solution, and we just wanted to see if that would work - avoiding introducing new things. Since there is a precedent, and @dorahermes makes a compelling argument that it will be cumbersome -- I vote (again) for the simplest option as in e.g. ,P1,P2,P3,

arokem commented 1 year ago

Agreed. Based on the responses here, I think that the original line of thought seems to be prevailing. I would much rather have folks who are actually using this kind of data weigh in on what is practical for them.

christinerogers commented 1 year ago

Thanks @CPernet and @arokem - good to know this was discussed at steering even

I vote (again) for the simplest option as in e.g. ,P1,P2,P3,

ok i.e. HED's P1, P2, P3 sans leading delimiter

bids-standard / bids-specification

ENH: Array data in `.tsv` cells #1446

Comma-separation

Wrapped delimiters