bids-standard / bids-specification

Brain Imaging Data Structure (BIDS) Specification
https://bids-specification.readthedocs.io/
Creative Commons Attribution 4.0 International
270 stars 156 forks source link

Within-stimuli conditions #153

Open adelavega opened 5 years ago

adelavega commented 5 years ago

stim_file columns in event files allow users to specify which stimuli files are associated with an event onset:

stim_file | OPTIONAL. Represents the location of the stimulus file (image, video, sound etc.) presented at the given onset time. ...

However, what this does not allow for is the specification of sub-conditions that occur during a long-running stimulus.

For example, in ds001545 a video file is presented which spans the entirety of the run. However, within each run/video there are 6 distinct conditions.

For example:

onset duration trial_type stim_file
6 90 Intact A cond1_run-01.mp4
105 90 Scramble Fix C cond1_run-01.mp4
204 90 Scramble Rnd B V1 cond1_run-01.mp4
303 90 Scramble Fix C cond1_run-01.mp4
402 90 Intact A cond1_run-01.mp4
501 90 Scramble Rnd B V2 cond1_run-01.mp4

IMO, the above example is invalid as the stim_file only has a single onset. The following is an event file which has all the necessary information (note I'm having to guess when the onset of the stim_file is, it could actually be 0).

onset duration trial_type stim_file
6 540 n/a cond1_run-01.mp4
6 90 Intact A n/a
105 90 Scramble Fix C n/a
204 90 Scramble Rnd B V1 n/a
303 90 Scramble Fix C n/a
402 90 Intact A n/a
501 90 Scramble Rnd B V2 n/a

However, this is ambiguous as the conditions are only implied to occur during stimulus presentation due to the duration of the first row.

@tyarkoni suggests adding optional but strongly encouraged stim_onset and stim_offset columns. These would denote onsets within a stimulus.

yarikoptic commented 5 years ago

I would have made it

onset duration trial_type stim_file
6 540 Movie starts cond1_run-01.mp4
6 90 Intact A cond1_run-01.mp4
105 90 Scramble Fix C cond1_run-01.mp4
204 90 Scramble Rnd B V1 cond1_run-01.mp4
303 90 Scramble Fix C cond1_run-01.mp4
402 90 Intact A cond1_run-01.mp4
501 90 Scramble Rnd B V2 cond1_run-01.mp4

stim_onset/stim_offset - I guess could be added but would have redundant information which could be computed (and validated to not go beyond stimuli duration) from "Movie starts" for that stimuli and corresponding onset and duration. And we all know what happens when there is redundancy ;)

As for the hierarchical description of events -- isn't there https://bids-specification.readthedocs.io/en/latest/99-appendices/03-hed.html ? (never used it myself though)

tyarkoni commented 5 years ago

I'm not crazy about either of the solutions proposed above because, while both compliant with the current spec, neither one eliminates the fundamental ambiguity here, which is that you don't know which part of the clip is being presented. It also is kind of problematic from a BIDS-StatsModel standpoint, because it will cause almost all users to have to drop a Filter transformation into their model just to weed out the first row, since nobody is going to want that in their model.

The benefit of having optional stim_onsetand stim_offset columns is those would eliminate the ambiguity in question without making most model specifications more complex. What I don't like about this proposal is that the extra columns are essentially metadata—there's virtually no situation under which they would be treated like other non-mandatory columns (i.e., as containing design-relevant information).

The more I think about this, the more I lean towards maybe keeping the current approach and not codifying this at all in the _events.tsv files. Maybe the solution is to require a supplementary metadata file for the stimulus files that contains the onsets. I.e., cond1_run-01.mp4 would have to have a cond1_run-01.json file that has fields PresentationOnset and PresentationOffset. But even that isn't sufficient, because presentation onset/offset can vary not just by stimulus, but also by event...

Should we just say this is in the 20% (really more like 1%) and not worry about it?

yarikoptic commented 5 years ago

BTW, ... Do they actually would need to filter then out? Why don't you want them to model that entire "super" condition as well? If there are different movie cuts, you might want them explicitly in the model, even if only to absorb transition (if it visible) between different stimuli. If there is only one big one for the entire run - well, it will largely be your constant. If there design disbalance and stimuli files have subtle unique features to them (differently trimmed, color scheme, audio volume level), having them modeled might save us from one other possible retraction.

The only problem I see is if all the trials follow each other in such a way that model becomes degenerate if the whole stim file condition is present too. So, overall, it might be specific design related.

The only cons is that may be those stimuli onset and duration are actually of interest to other tools, not just the linear model, so they would need to recompute them as well. But it shouldn't be too hard.

As for extra unused meta data - I would say the more the merrier. My main concern is the fear of it being redundant and this requiring "manual" recomputation if I find that eg I need to fix onset. Then I will forget and the stimuli onset value will no longer be valid

adelavega commented 5 years ago

I would agree this probably falls into the 1% as the majority of experiments don't have sub-conditions within a stimuli. And so in 90% of cases, the mention of a stimulus indicates a complete presentation, so this is such a rare situation its probably not worth putting in the spec itself.

I still think it might be worth clarifying that including a stimulus in stim_file does not necessary indicate that the stimulus is played from the beginning (which is what I thought on first read).

satra commented 5 years ago

i'm not sure this is 1%. in many standard experiments, there are sub conditions. for example in experiments that involve showing faces/objects there are often sub categories: emotions, types of objects, types of faces (human faces/animal faces). in fact the modified hariri task is a perfect example of this, and gets used by emotion/mood researchers a lot.

i don't think we should reinvent ontologies of stimuli (e.g., paradigms - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3682219/, audio - https://research.google.com/audioset/ontology/index.html, images - https://bioportal.bioontology.org/ontologies/BIM).

but provide a way where stimulus properties can be encoded appropriately.

The more I think about this, the more I lean towards maybe keeping the current approach and not codifying this at all in the _events.tsv files. Maybe the solution is to require a supplementary metadata file for the stimulus files that contains the onsets. I.e., cond1_run-01.mp4 would have to have a cond1_run-01.json file that has fields PresentationOnset and PresentationOffset.

i like the idea of a json going alongside a stimulus file, but this json should be able to reflect timed objects inside it.

satra commented 5 years ago

just to follow up:

  1. so in case of the hariri task, trial_type can represent the most dominant trial_type (faces, objects for example, or for mood researchers neutral/angry/etc). then stimulus properties could somehow represent not only details like cropping/full frame, colorspace, etc., but also ontological objects like this is an image/video of a face.

  2. so in case of a movie, events.tsv could simply say i showed this clip for 240s. the stimulus event should have a json file that can encode many different types of extracted events within the clip.

  3. another option is to allow multiple events files, and any model has to refer to a specific event file (may be we allow composition of event files).

tyarkoni commented 5 years ago

@satra by "sub-conditions" here we're not talking about hierarchical organization, we're talking about a temporal subset of a single file. Codifying hierarchical structures is IMO not in scope, but in any case presents no particular challenge from an events.tsv perspective, because you can just put the filename for each event in the stim_file column, and the analyst is welcome to do whatever they want with that. The case we're talking about is where you have, say, an 8-minute movie file identified as the stim_file, but the presentation starts halfway through that clip. In such cases the analyst needs to have some way to know that the onset of the presentation isn't synced with the onset of the event. But this seems like an edge case (indeed, I'm pretty sure this is the first BIDS dataset we've run into where it's an issue), so the proposal is to just let it be.

tyarkoni commented 5 years ago
  1. then stimulus properties could somehow represent not only details like cropping/full frame, colorspace, etc.

I think this is analogous to the movie example, but I still think it's an edge case. Situations where researchers dynamically crop images are likely to be pretty rare; in most cases, the cropping will have been done in advance, and what's in the stimuli/ folder will be what was presented to the subject.

I think a reasonable way to update the spec is to strongly encourage users to provide files in stimuli/ that are as close as possible to the ones participants actually experienced. That means temporally or spatially cropping movies and images if needed. But I agree with @adelavega that we should also explicitly say that there is no actual guarantee that the contents of stimuli map perfectly onto what participants experienced.

satra commented 5 years ago

@tyarkoni - sorry i misunderstood the within stimuli conditions, so please ignore the ontological variations (although see last paragraph below).

for movie, i'm thinking of things like commercial clips that are shown, and i'm sure that certain clips cannot be shared.

for movies as an example are you saying i can extract faces, then specific emotions on those faces, and then encode both face and face+emotion in the events file, kind of a redundant stimulus list. all possible events in trial_type and then the analyst figures out which trials are of interest? for many of our tasks, that would work pretty well.

tyarkoni commented 5 years ago

for movie, i'm thinking of things like commercial clips that are shown, and i'm sure that certain clips cannot be shared.

I don't know that we can do anything about this, short of asking people to provide a description of where/how to obtain stimuli that can't be publicly shared. I don't think it's worth trying to codify this—there's too much variability in what that procurement process could look like.

for movies as an example are you saying i can extract faces, then specific emotions on those faces, and then encode both face and face+emotion in the events file, kind of a redundant stimulus list.

Sure, you can create arbitrary columns in events.tsv that code anything you like. Aside from stim_file, you could add columns for face_id, face_gender, face_age, face_emotion_rater1, face_emotion_rater2, face_emotion_avg, and anything else you like. The expectation is that you then put descriptions of columns in the data dictionary in the JSON sidecar, though I believe this is non-mandatory right now.

Remi-Gau commented 11 months ago

This is an old one.

I wonder if HED tags can help with such issue. @VisLab do you have some opinion on this?

VisLab commented 11 months ago

As it turns out the HED Working Group has been discussing this very issue and some of our members will weigh in shortly with a concrete proposal --- @neuromechanist @dorahermes @tpatpa @dungscout96 @monique2208 @makeig

dorahermes commented 11 months ago

Yes, I agree that HED tags can come in useful here and probably tackle this issue. When working through an example it seems like this may be a relatively larger contribution with some added machine readable files in the /stimuli/ folder. When starting to work a visual images and movie example with @neuromechanist it seems that there would be a need for community input for review and other examples such as e.g. auditory, motor, electrical stimulation, etc as well. This seems to perhaps go to the scope of a potential BEP. Should we open a separate GitHub issue to discuss whether to open a BEP or continue here?

@neuromechanist could share a preliminary google doc (not BEP yet, just the examples we were working through) if that would help give an idea?

Tagging some people who previously contributed to this discussion for input: @adelavega @tyarkoni @yarikoptic @satra @Remi-Gau

Remi-Gau commented 11 months ago

if we are talking about a BEP to help organize stimuli then there is overlap with : https://github.com/bids-standard/bids-specification/issues/751

neuromechanist commented 11 months ago

Reading here and #751 resonates closely with the challenges we are exploring for including image and movie annotations into a couple of massive datasets we are working on. @dorahermes, and @tpatpa are working on the annotation of the Natural Scene Dataset, and @smakeig, @dungscout96, and I are working toward Healthy Brain Network's movie annotation.

In both projects, we see the need for top-level annotation files that would be used in the downstream *_events.tsv.

In this Google Doc, we are exploring the possibility of a file such as stimuli/stimuli.tsv to hold a list of the stimulus files and possible annotations (stimuli/stimuli.tsv is very similar to stims.tsv discussed in #751).

A sample stimuli.tsv file would look like this:

stim_file type NSD_id COCO_id first_COCO_description HED
nsd02951.png still_image 2951 262145 “an open market full of people and piles of vegetables.” ((Item-count, High), Ingestible-object)), (Background-view, ((Human, Body, Agent-trait/Adult), Outdoors, Furnishing, Natural-feature/Sky, Urban, Man-made-object))"

If the stimulus file has a time-varying context (such as a movie), a separate *_stimulus.tsv will hold the annotations. The structure of *_stimulus.tsv would be very similar to *_events.tsv with onset, duration fields, etc. In any case, including the stim_file name in the *_events.tsv's stim_file column would link the task events (*_events.tsv) and stimulus annotation (stimuli.tsv and *_stimulus.tsv).

We believe this method will make the annotation of stimulus files more reusable; researchers can reuse the stimulus files and select the stimuli.tsv rows (and *_stimulus.tsv files) of their choice for their new studies. Also, reusing the dataset with alternate annotations for the same stimulus files would be as straightforward as adding a column to *_stimulus.tsv or replacing the whole file with a new one.

We appreciate your thoughts and comments on the Google Doc, as well as here. Our use cases are limited to a couple of visual and audiovisual stimuli. Many other stimulation types may require other arrangements. We appreciate that you also include examples of other stimulus types, if possible.

dorahermes commented 11 months ago

@bids-standard/maintainers would be great to hear your thoughts on whether this is worthy of a small BEP, thank you!

Remi-Gau commented 11 months ago

Maybe not a BEP but several small orthogonal pull requests?

I can try to bring it up at the next maintainers meeting.

neuromechanist commented 8 months ago

Following https://github.com/hed-standard/hed-python/issues/810, it seems that expanding the _events.tsv files, with what was called subconditions in the first post of the issue, is a remodeler issue. Nevertheless, the remodeler would require rules and guidelines to remodel the _events.tsv with the contents of the stimuli/ directory.

As described in the HED issue above and also in the GDoc we are drafting for this issue, there could be two variations of this issue:

  1. Column-only extension for still stimuli, so that only specific columns (and annotations) would be added to the _events.tsv.
  2. Row extension with the possibility of column extension, in which the contents of a specific stimulus file will be merged with the contents of the _events.tsv.

A working example for the second case, which is the main focus of this issue, is the following scenario: In the CMI Healthy Brain Network project, subjects watch the Present movie during fMRI and EEG sessions, among other tasks (see a sample of the EEG-BIDS dataset).

The events for the Present movie are limited to the start and stop of the video: onset duration sample value event_code
0.000 0.002 0 9999 9999
2.034 0.002 1017 video_start 84
205.098 0.002 102549 video_stop 104

However, it is clear that a movie contains far more events, and researchers would desire to provide their annotations based on their application. As a straightforward example, we identified the shot transition events and quantified the Log Luminance Ratio of this shot transition. The file included in the dataset as stimuli/the_present_stimulus-LogLumRatio.tsv:

onset duration shot_number LLR
0 n/a video_start video_start
0 7.25 1 n/a
7.25 3.542 2 -1.557820733
10.792 5.208 3 0.3358234903
16 5 4 -0.03306866929
21 4.208 5 -0.2070276568
... ... ... ...
165.25 6.667 55 -0.2270603551
171.917 31.292 56 0.1188704433
203.208 n/a video_stop video_stop

To merge the _stimulus.tsv into the _events.tsv after the initial import process (i.e., remodeling the events table) into EEGLAB, I have made a function that:

  1. gets the EEG structure, the _stimulus.tsv, and the names of the columns for extension,
  2. finds the common event names (here, video_start and video_stop) between the value column and the mentioned columns for extension,
  3. compares/corrects the timelines of the common events,
  4. merges the events of the _stimulus.tsv
  5. recreates EEG.event structure

This implementation is far from perfect, but it could serve as a working example of the implications of this mechanism for large and very large datasets. The Healthy Brain Network Project spans over 7000 subjects with EEG and fMRI, and this mechanism will help dynamically use event annotations based on the research's use case.

adelavega commented 8 months ago

I haven't had time to look at the entire proposal in detail, but overall the concept of annotating stimuli seperately from the _events.tsv file seems like a reasonable proposal, as it allows for the inclusion of detailed stimuli annotations, without fundumentally changing the way _events.tsv works

neuromechanist commented 4 months ago

Following 4/12's conversations with @Remi-Gau, @adelavega, @yarikoptic, @arnodelorme and @dungscout96, there is quite an enthusiasm to provide structure for the stimuli/ directory.

@yarikoptic and I jotted on the Google Doc to modify the suggestions to a (directory-less) BIDS naming structure, which also follows the ideas in #751.

Based on the Google Doc example, here is a draft suggestion:

stim-present_???.mp4|mkv|jpg|png
stim-present_???.json 
[stim-present_annot-loglum_events.tsv]
[stim-present_annot-loglum_events.json]
…
stimuli.tsv
stimuli.json

TODO:

CC @VisLab, @dorahermes, and @monique2208 for comment.

adelavega commented 4 months ago

Looks good, but I'm concerned that mandating stimuli have a specific name would make this backwards incompatible w/ existing datasets (which name stimuli files whatever they want, and just refer to them in the _events.tsv files)

It's a minor concern, but it just seems slightly out of scope to mandate a new way to name stimuli files. Would this required overall even if you do not have annotations?

adelavega commented 4 months ago

Seems like there was discussion regarding the top level stim- prefix here: https://github.com/bids-standard/bids-specification/issues/751

VisLab commented 4 months ago

Looks good, but I'm concerned that mandating stimuli have a specific name would make this backwards incompatible w/ existing datasets (which name stimuli files whatever they want, and just refer to them in the _events.tsv files)

Not sure the proposal has to be backwards incompatible:

Now: events.tsv with stim_file column value xxx/yyy.zzz implies a file in ./stimuli/xxx/yyy.zzz.

Potential proposal: the above stays the same... but...

In the ./stimuli/stimuli.tsv file, the row for this file has first column value: ./stimuli/xxx/yyy.zzz and other columns can appear as defined in ./stimuli/stimuli.json file.

Suppose that the stimulus file is a movie with annotations then in ./stimuli/xxx directory there can be a yyy_arbitrarystuff_annot.tsv and yyy_arbitrarystuff_annot.json that are intrepreted as annotations for yyy.zzz. (Multiple raters may be available.)

The directory structure within the ./stimuli folder can be arbitrary as it is now.

neuromechanist commented 4 months ago

Current contenders for the stimuli modality suffix include:

  1. _stimulus (example: stim-the-present_stimulus.mp4)
  2. _media (example: stim-the-present_media.mp4)
  3. _stream (example: stim-the-present_stream.mp4)

Feel free to let me know if you have any other suggestions and which one you prefer, so I can update the list.

adelavega commented 4 months ago

_stimulus seems oddly redundant with the stim- prefix, otherwise I slightly prefer _media but have no strong opinions.

yarikoptic commented 4 months ago

In the spirit of the future BIDS 2.0 with e.g.

neuromechanist commented 3 months ago

Ok,sounds great. It seems that proposing stim and annotentities have a good support. I'll make a pull request for them.

The suffix may need more consideration. Currently, _media seems to have more appeal.

Just a note that there is already _stim suffix for individual stimulus files defined under physio data type. But, I believe that these two use cases have little relation to each other.

neuromechanist commented 3 months ago

Also, should we convert this issue to a BEP? Converting to BEP hopefully makes the enhancements more visible and maintainable (although, it will also require more work).

Talking to @yarikoptic and @dorahermes, they both seem to support a BEP for this issue.

neuromechanist commented 3 months ago

Added PR #1814 to add stimulus and annotation entities and the stim_id column.

The next steps would require inputs for:

monique2208 commented 1 month ago

It would be great to have this formalized! We have a large number of datasets where we present the same short movie as a localizer. Having one general annotation file which could apply to all of these datasets would really help with the analysis, it would remove a lot of redundancy in the event files and and I think it would provide something interesting to share on its own.

yarikoptic commented 1 week ago

[ ] suffix: (_media)

Our suffixes so far can correspond to a number of things, but most typically quite specific to "data modality", so here we might want to be more specific too, e.g. have _audio, _video, _audiovideo (or _audio+video in some future BIDS where + would be allowed there) even though most often could be discerned from an extension (but not necessarily).

effigies commented 1 week ago

+1 for _audio, _video and _audiovideo. It would make it easy to set permitted extensions for audio and video separately, and then just take the intersection for audiovideo.

neuromechanist commented 1 week ago

@yarikoptic, @dorahermes and I will meet on Tuesday 8/13 at 10 am PT to discuss the progress and the next steps. Please reach out to me if you want to join the conversation and I'll share the meeting details.

neuromechanist commented 1 week ago

[ ] suffix: (_media)

Our suffixes so far can correspond to a number of things, but most typically quite specific to "data modality", so here we might want to be more specific too, e.g. have _audio, _video, _audiovideo.

Probably we should include _image too. Agreed that with separated entities, checking the file extensions are much easier.

neuromechanist commented 3 days ago

@yarikoptic, @dorahermes, @TheChymera, and I joined the meeting. We agreed that the broad scope of the changes (including adding a prefix, a couple of entities, and suffixes) and their usability in several fields (EEG, fMRI, EEG, ...) justifies requesting a BEP.

@bids-maintenance, could you help raise this issue and elevate it to a BEP?

A couple of other discussion points during the meeting were: 1) Adopting _part for multi-part stimulus files, 2) discussion about the stimulus type and how it should be documented in the stimuli.tsv, 3) file suffixes, 4) whether to allow json-only files when the stimulus files are not present (for example to describe device and conditions that presented the stimulus files), and 5) to resolve the concluded comments and reviews in the Google Doc.

The main discussion is on this Google Doc. The next meeting will be on August 27th at 10 a.m. PT.