Discuss BEP021 derivatives for electrophys

bids-standard / bep021

Organizing and coordinating the BIDS Extension Proposal 21: Common Electrophysiology Derivatives

https://bids.neuroimaging.io/bep021

5 stars 1 forks source link

Discuss BEP021 derivatives for electrophys #5

Open robertoostenveld opened 3 years ago

robertoostenveld commented 3 years ago

This continues the issue started here https://github.com/bids-standard/bids-specification/issues/733. This BEP021 is a better place to continue the discussion, as we can also use projects and share files here which would not fit under bids-specification directly.

robertoostenveld commented 3 years ago

On 12 February 2021 we had a Zoom call to discuss the progress on BEP021 for which the draft is on http://bids.neuroimaging.io/bep021 on google docs. Some of the people that attended were @arnodelorme, @dorahermes, @guiomar, @jasmainak, @agramfort, but do not know the github handle of all. Please help to attend the others with a github presence by mentioning them here.

robertoostenveld commented 3 years ago

@jasmainak mentioned

The example dataset:

bids-standard/bids-examples#171 bids-standard/bids-examples#161

I haven't looked too closely but I see that it was merged by Robert :) Perhaps a dataset to use for discussion?

robertoostenveld commented 3 years ago

The pull requests bids-standard/bids-examples#171 and bids-standard/bids-examples#161 refer to “derivatives”, but looking at the dataset at https://github.com/bids-standard/bids-examples/tree/master/eeg_face13 there is nothing that sets it apart from a regular raw BIDS EEG dataset.

An example dataset that is a proper derivatives is https://github.com/bids-standard/bids-examples/tree/master/ds000001-fmriprep. Note, however, that there are quite some files in that derived dataset that have to be ignored as they are not standardized (yet).

Just some general pointers that relate to some topics we discussed:

On https://bids-specification.readthedocs.io/en/stable/03-modality-agnostic-files.html#derived-dataset-and-pipeline-description it is specified that a derived dataset is also a BIDS dataset. The derived dataset has some required and some recommended extra elements in the dataset_description. When you prepare an example/draft derived ieeg/eeg/meg dataset, please keep these in mind.

Also, on https://bids-specification.readthedocs.io/en/stable/05-derivatives/01-introduction.html it states "Derivatives are outputs of common processing pipelines, capturing data and meta-data sufficient for a researcher to understand and (critically) reuse those outputs in subsequent processing.” That lines up with what Scott and I were saying that a (derived) dataset should be (re)usable in its own right.

robertoostenveld commented 3 years ago

ping @adam2392 @hoechenberger

robertoostenveld commented 3 years ago

@guimar wrote:

Hi! I copy here the email too :)

Thanks everyone for the meeting today!

We thought that for the next meeting some of us could present some practical examples so we can have a starting point to discuss. Preferably we could share the datasets in advance so everyone that has time can go over them and focus the discussion in the more problematic things.

Here's the link to the common derivatives specs merged in 1.4.0: https://bids-specification.readthedocs.io/en/stable/05-derivatives/01-introduction.html There, "derivative" is defined as:

Derivatives are outputs of common processing pipelines, capturing data and meta-data sufficient for a researcher to understand and (critically) reuse those outputs in subsequent processing.

In line with what we have discussed.

Also note that there are some metadata fields to point to source data: https://bids-specification.readthedocs.io/en/stable/05-derivatives/02-common-data-types.html

It's important to distinguish between two types of derivatives:

pre-processed (in essence they are similar to source data - datatype remains unchanged) and processed (they are substantially different to source data - datatype changed) So I think pre-processed derivatives can be easily accessible at this point (e.g. detrending, filtering, downsampling, re-referencing, etc).

In this logic I don't know where annotations may exactly fall. But they are also an important step (and I think that can be also interesting for other modalities and extensions).

Talk very soon!

robertoostenveld commented 3 years ago

and @sappelhoff wrote that

As far as I heard, @smakeig (Scott Makeig) @nucleuscub (Ramon Martinez-Cancino) also attended

arnodelorme commented 3 years ago

Thank you, Robert, for this summary. We will generate test datasets as discussed and then we should reconvene.

guiomar commented 2 years ago

Hi @sappelhoff @robertoostenveld @arnodelorme, @dorahermes, @jasmainak, @agramfort @hoechenberger @smakeig @nucleuscub @CPernet !

I want to retake the effort on ephys derivatives.

Having a new look at the document, I see there are 3 main blocks: 1) Annotations 2) Preprocessing (derivatives that doesn't change datatype) 3) Procesing (new datatype)

I would incline to divide the work into independent lines, to avoid getting stuck due to the big amount of work ahead. And I would start by dealing with: 2) pre-processing. We have almost all the details needed to takle this one. I'm preparing some examples.

Would you like to meet and move this part forward? https://www.when2meet.com/?13797922-eLdhm

hoechenberger commented 2 years ago

And I would start by dealing with: 2) pre-processing.

Preprocessing sounds good to me, I'm not sure about the point in the parens though ("derivatives that doesn't change datatype"). To me, preprocessing also includes epoching the data, which will create new data types too. But I'm also happy to just limit the next discussion to continuous data only (Maxwell-filtering, frequency filtering, …)

I will share some datasets we process using our BIDS pipeline shortly so you all can see how we're currently approaching things.

Cheers, Richard

cc @agramfort

guiomar commented 2 years ago

Thanks @hoechenberger !!

I think this definition comes from the generic derivatives specification, but I'm not able to find it anymore. Still it makes sense to differenciate between the two at this first approach, until we agree on the data format to be used when the datatype from the raw source is changed, to avoid getting blocked by this decision. The annotations part is also more prone to debate. So let's focus on the more objective and straight forward!

This sounds awesome Richard! Thanks!

adam2392 commented 2 years ago

I have read over the BEP021 derivatives and also added my availability, although I'll be in California at the time, so could possibly be hard to overlap.

I also utilized the "annotations" derivatives framework to create a dataset of event markings found in the iEEG via an automated algorithm. Specifically, these are "high-frequency oscillation" (HFO) onset/duration/channels stored as a tsv file. It works well for my use-case, but the tsv file does "explode" in length. Dataset is in dropbox link.

https://www.dropbox.com/sh/5ih5ay9fvo3q12s/AADBY5eDc_SmszHGyC3Mn6QJa?dl=0

jasmainak commented 2 years ago

I'm hammered that week. Unlikely to be able to join :(

I agree about the "explosion" issue that @adam2392 pointed out. I think I've seen it before. Probably it might help to limit the vocabulary of annotations so it is machine-readable not just machine-writeable.

guiomar commented 2 years ago

Hi @adam2392! Awesome! We can review the annotations with your example as well :)

guiomar commented 2 years ago

It seems the most preferred day to meet is Wed 15th Dec from 5 to 6pm CET. I'll send you calendar invites with hangout link :)

hoechenberger commented 2 years ago

Thank you, @guiomar!

guiomar commented 2 years ago

For those who didn't receive the invitation and still wan to join, this is the link to hangouts:

bep021 - ephys derivatives Wednesday, 15 December · 17:00 – 18:00 Google Meet joining info Video call link: https://meet.google.com/ccc-uzuc-nyg Or dial: ‪(US) +1 716-249-4224‬ PIN: ‪807 934 263‬#

guiomar commented 2 years ago

Thank you all who joined yesterday! @hoechenberger @robertoostenveld @adam2392 @smakeig @tpatpa

I would like to sumarize here some of the main points we discussed in the meeting:

1) Annotations:

Use HED Tags: HED 0.0.1-score –> EEG-artifacts https://www.hedtags.org/display_hed_library.html?fireglass_rsn=true#fireglass_params&tabid=af6ad83fcc95e6c2&start_with_session_counter=2&application_server_address=mc5.prod.fire.glass
Try to stick with events.tsv files, since they already support HED tags and have very similar purpose. Try some examples and suggent any new field that could be nice to include so it can also cover annotations properly.

2) Preprocessing steps:

When ever possible use the BIDS Derived dataset and Pipeline Description. Described in a general dataset_description.json file https://bids-specification.readthedocs.io/en/stable/03-modality-agnostic-files.html#derived-dataset-and-pipeline-description
This may not include specific metadata for main parameters of toolboxes functions that may run more individually, and that could be of interest reporting for reproducibility purposes and easy metadata search

I have dedicated some time to reorganize the BEP021 documentation accordingly, since it was becoming very messy: https://docs.google.com/document/d/1PmcVs7vg7Th-cGC-UrX8rAhKUHIzOI-uIOh69_mvdlw/

Please, add any other point I may have forgoten and you considered important :)

We planned to do another meeting in January to show some examples and continue further discussing the remaining issues. If you are interested, you can mark your availabilities here: https://www.when2meet.com/?13923648-HOHfD

Talk soon!

smakeig commented 2 years ago

Attached fyi is the paper in press in NeuroImage on HED event annotation in BIDS EEG (or MEG) data - a how-to and Guidelines expo.

Scott

On Thu, Dec 16, 2021 at 2:37 PM Julia Guiomar Niso Galán < @.***> wrote:

Thank you all who joined yesterday! @hoechenberger https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_hoechenberger&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=bhG80-6XSWL9CggKkjJxuVIX5ZQjvwJfDOh5MJkeWxU&e= @robertoostenveld https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_robertoostenveld&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=H9HqKGfEAWx7iHsWFVJ6CME71iOei_peE1kEUnGbfTc&e= @adam2392 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_adam2392&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=mruKDmvNu58At8fC-Im7Syvhe0c1hTP8e8X72o1c0ME&e= @smakeig https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_smakeig&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=UEZJS3i2yFOUmMIFafVThNf53ZAtqKyLe9qxZNp4gcw&e= @tpatpa https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_tpatpa&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=VzK4soIyPBJit34-KWfJodRLGMF5OsvtBrXwrIb_nyA&e=

I would like to sumarize here some of the main points we discussed in the meeting:

1) Annotations:

Use HED Tags: HED 0.0.1-score –> EEG-artifacts

https://www.hedtags.org/display_hed_library.html?fireglass_rsn=true#fireglass_params&tabid=af6ad83fcc95e6c2&start_with_session_counter=2&application_server_address=mc5.prod.fire.glass https://urldefense.proofpoint.com/v2/url?u=https-3A__www.hedtags.org_display-5Fhed-5Flibrary.html-3Ffireglass-5Frsn-3Dtrue-23fireglass-5Fparams-26tabid-3Daf6ad83fcc95e6c2-26start-5Fwith-5Fsession-5Fcounter-3D2-26application-5Fserver-5Faddress-3Dmc5.prod.fire.glass&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=e148BEPKOSTzb63vUJr-U02HS3h12LXANBBFOdtGeAw&e=

Try to stick with events.tsv files, since they already support HED tags and have very similar purpose. Try some examples and suggent any new field that could be nice to include so it can also cover annotations properly.

2) Preprocessing steps:

When ever possible use the BIDS Derived dataset and Pipeline Description. Described in a general dataset_description.json file

https://bids-specification.readthedocs.io/en/stable/03-modality-agnostic-files.html#derived-dataset-and-pipeline-description https://urldefense.proofpoint.com/v2/url?u=https-3A__bids-2Dspecification.readthedocs.io_en_stable_03-2Dmodality-2Dagnostic-2Dfiles.html-23derived-2Ddataset-2Dand-2Dpipeline-2Ddescription&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=ATOdyDSuHlRKFA7Y6VxW2J9ubm7e3whu-qQdNrAJgPU&e=

This may not include specific metadata for main parameters of toolboxes functions that may run more individually, and that could be of interest reporting for reproducibility purposes and easy metadata search

I have dedicated some time to reorganize the BEP021 documentation accordingly, since it was becoming very messy:

https://docs.google.com/document/d/1PmcVs7vg7Th-cGC-UrX8rAhKUHIzOI-uIOh69_mvdlw/ https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1PmcVs7vg7Th-2DcGC-2DUrX8rAhKUHIzOI-2DuIOh69-5Fmvdlw_&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=aDokF6Nv3gDOaEtsSDwyt_XdoMGeG92yzQFugOys-lA&e=

Please, add any other point I may have forgoten and you considered important :)

We planned to do another meeting in January to show some examples and continue further discussing the remaining issues. If you are interested, you can mark your availabilities here: https://www.when2meet.com/?13923648-HOHfD https://urldefense.proofpoint.com/v2/url?u=https-3A__www.when2meet.com_-3F13923648-2DHOHfD&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=x4ihZS0agqJlYm9VyqyMWry5w_iyEDwOY4yGnTYdPw4&e=

Talk soon!

— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_bids-2Dstandard_bep021_issues_5-23issuecomment-2D996121554&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=635i5IG5qAWzLeU5Lyu1oRydNTRckauH6Tmq7Mn09Sg&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AKN2SFWKEXIODV4OCSTZZ4TURI5WBANCNFSM4XUNFFSA&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=W-X-4t9n2ttm7p-V6gz9f3tkW-0Vw8tcFlsw0rbE0Ew&e= . Triage notifications on the go with GitHub Mobile for iOS https://urldefense.proofpoint.com/v2/url?u=https-3A__apps.apple.com_app_apple-2Dstore_id1477376905-3Fct-3Dnotification-2Demail-26mt-3D8-26pt-3D524675&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=sQlcl7o2HcZZXa12AM3z99GyNpAQvEE5bi7gx4Hdaog&e= or Android https://urldefense.proofpoint.com/v2/url?u=https-3A__play.google.com_store_apps_details-3Fid-3Dcom.github.android-26referrer-3Dutm-5Fcampaign-253Dnotification-2Demail-2526utm-5Fmedium-253Demail-2526utm-5Fsource-253Dgithub&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=lTzwVPnY1W06GppgWzFHgKEib-wM9FD_M31Kb-dc7Zo&e=.

You are receiving this because you were mentioned.Message ID: @.***>

-- Scott Makeig, Research Scientist and Director, Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California San Diego, La Jolla CA 92093-0559, http://sccn.ucsd.edu/~scott

guiomar commented 2 years ago

Thnaks @smakeig! I can't see any attachment, maybe it's easier if you share the links?

smakeig commented 2 years ago

The paper in press is available at

https://drive.google.com/file/d/1c8WaTSBgHgOmARX3g8e3LpD7izfDlG9l/view?usp=sharing

Scott

On Sat, Dec 18, 2021 at 5:35 PM Scott Makeig @.***> wrote:

Attached fyi is the paper in press in NeuroImage on HED event annotation in BIDS EEG (or MEG) data - a how-to and Guidelines expo.

Scott

On Thu, Dec 16, 2021 at 2:37 PM Julia Guiomar Niso Galán < @.***> wrote:

Thank you all who joined yesterday! @hoechenberger https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_hoechenberger&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=bhG80-6XSWL9CggKkjJxuVIX5ZQjvwJfDOh5MJkeWxU&e= @robertoostenveld https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_robertoostenveld&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=H9HqKGfEAWx7iHsWFVJ6CME71iOei_peE1kEUnGbfTc&e= @adam2392 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_adam2392&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=mruKDmvNu58At8fC-Im7Syvhe0c1hTP8e8X72o1c0ME&e= @smakeig https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_smakeig&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=UEZJS3i2yFOUmMIFafVThNf53ZAtqKyLe9qxZNp4gcw&e= @tpatpa https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_tpatpa&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=VzK4soIyPBJit34-KWfJodRLGMF5OsvtBrXwrIb_nyA&e=

I would like to sumarize here some of the main points we discussed in the meeting:

1) Annotations:

Use HED Tags: HED 0.0.1-score –> EEG-artifacts

https://www.hedtags.org/display_hed_library.html?fireglass_rsn=true#fireglass_params&tabid=af6ad83fcc95e6c2&start_with_session_counter=2&application_server_address=mc5.prod.fire.glass https://urldefense.proofpoint.com/v2/url?u=https-3A__www.hedtags.org_display-5Fhed-5Flibrary.html-3Ffireglass-5Frsn-3Dtrue-23fireglass-5Fparams-26tabid-3Daf6ad83fcc95e6c2-26start-5Fwith-5Fsession-5Fcounter-3D2-26application-5Fserver-5Faddress-3Dmc5.prod.fire.glass&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=e148BEPKOSTzb63vUJr-U02HS3h12LXANBBFOdtGeAw&e=

Try to stick with events.tsv files, since they already support HED tags and have very similar purpose. Try some examples and suggent any new field that could be nice to include so it can also cover annotations properly.

2) Preprocessing steps:

When ever possible use the BIDS Derived dataset and Pipeline Description. Described in a general dataset_description.json file

https://bids-specification.readthedocs.io/en/stable/03-modality-agnostic-files.html#derived-dataset-and-pipeline-description https://urldefense.proofpoint.com/v2/url?u=https-3A__bids-2Dspecification.readthedocs.io_en_stable_03-2Dmodality-2Dagnostic-2Dfiles.html-23derived-2Ddataset-2Dand-2Dpipeline-2Ddescription&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=ATOdyDSuHlRKFA7Y6VxW2J9ubm7e3whu-qQdNrAJgPU&e=

This may not include specific metadata for main parameters of toolboxes functions that may run more individually, and that could be of interest reporting for reproducibility purposes and easy metadata search

I have dedicated some time to reorganize the BEP021 documentation accordingly, since it was becoming very messy:

https://docs.google.com/document/d/1PmcVs7vg7Th-cGC-UrX8rAhKUHIzOI-uIOh69_mvdlw/ https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1PmcVs7vg7Th-2DcGC-2DUrX8rAhKUHIzOI-2DuIOh69-5Fmvdlw_&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=aDokF6Nv3gDOaEtsSDwyt_XdoMGeG92yzQFugOys-lA&e=

Please, add any other point I may have forgoten and you considered important :)

We planned to do another meeting in January to show some examples and continue further discussing the remaining issues. If you are interested, you can mark your availabilities here: https://www.when2meet.com/?13923648-HOHfD https://urldefense.proofpoint.com/v2/url?u=https-3A__www.when2meet.com_-3F13923648-2DHOHfD&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=x4ihZS0agqJlYm9VyqyMWry5w_iyEDwOY4yGnTYdPw4&e=

Talk soon!

— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_bids-2Dstandard_bep021_issues_5-23issuecomment-2D996121554&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=635i5IG5qAWzLeU5Lyu1oRydNTRckauH6Tmq7Mn09Sg&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AKN2SFWKEXIODV4OCSTZZ4TURI5WBANCNFSM4XUNFFSA&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=W-X-4t9n2ttm7p-V6gz9f3tkW-0Vw8tcFlsw0rbE0Ew&e= . Triage notifications on the go with GitHub Mobile for iOS https://urldefense.proofpoint.com/v2/url?u=https-3A__apps.apple.com_app_apple-2Dstore_id1477376905-3Fct-3Dnotification-2Demail-26mt-3D8-26pt-3D524675&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=sQlcl7o2HcZZXa12AM3z99GyNpAQvEE5bi7gx4Hdaog&e= or Android https://urldefense.proofpoint.com/v2/url?u=https-3A__play.google.com_store_apps_details-3Fid-3Dcom.github.android-26referrer-3Dutm-5Fcampaign-253Dnotification-2Demail-2526utm-5Fmedium-253Demail-2526utm-5Fsource-253Dgithub&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=7FpmeqJhFQbNqSDBnSfyBkyqNPxvMfb2s-d2sjIv7Pp0mhki3tWZR9QPD6VFyicI&s=lTzwVPnY1W06GppgWzFHgKEib-wM9FD_M31Kb-dc7Zo&e=.

You are receiving this because you were mentioned.Message ID: @.***>

-- Scott Makeig, Research Scientist and Director, Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California San Diego, La Jolla CA 92093-0559, http://sccn.ucsd.edu/~scott

guiomar commented 2 years ago

Hello @sappelhoff @robertoostenveld @arnodelorme, @dorahermes, @jasmainak, @agramfort @hoechenberger @smakeig @nucleuscub @CPernet @adam2392 @tpatpa!

We planned to do another meeting these days, if you are interested, you can mark your availabilities here: https://www.when2meet.com/?13923648-HOHfD

guiomar commented 2 years ago

Thanks a lot! Let's make it Tuesday 25th January at 1:00pm CET I'll send and invitation shortly

guiomar commented 2 years ago

Details for joining:

bep021 - ephys derivatives Tuesday, 25 January · 13:00 – 15:00 Google Meet joining info Video call link: https://meet.google.com/ccc-uzuc-nyg Or dial: ‪(US) +1 716-249-4224‬ PIN: ‪807 934 263‬#

robertoostenveld commented 2 years ago

Let me repost the content of an email here that I already sent to the invitees of the most recent BEP021 meeting. I'll reformat it slightly. Having the discussion here rather than via email keeps it better in the open for everyone to follow and contribute.

Following the discussion of 25 Jan, in which we updated the BEP021 google doc to indicate what is in and out of scope, I have worked on some example pipelines and derivatives corresponding to sections 6.2, 6.3, 6.4, 6.5 and 6.6 in the google doc.

I started with doi:10.18112/openneuro.ds003645.v1.0.0, downloaded it (partially) and wrote a script to make a selection. I do not consider this part of the pipeline yet (although it could have been), so my starting point is “ds003645_selection” (selection code included).

Starting from “ds003645_selection", I ran the following pipelines

filtering
downsampling (without explicit filtering, i.e. startting from ds003645_selection)
rereferencing
filtering, followed by downsampling, followed by rereferencing (as three pipelines)
filtering and downsampling and rereferencing (as one preproc pipeline)

This results in 6 derivatives (indicated in bold, also below). The code for each is in the respective “code” directory. Note that there are more lines of code needed for data handling than for the actual pipeline (which I implemented with FieldTrip).

The resulting directory tree (only showing directories, not the files) looks like this (see below) and can be browsed on google drive (no login needed, view only).

ds003645_selection
├── code
├── derivatives
│   ├── downsampling
│   │   ├── code
│   │   ├── sub-002
│   │   │   └── eeg
│   │   ├── sub-003
│   │   │   └── eeg
│   │   └── sub-004
│   │       └── eeg
│   ├── filter_and_downsample_and_rereference
│   │   ├── code
│   │   ├── sub-002
│   │   │   └── eeg
│   │   ├── sub-003
│   │   │   └── eeg
│   │   └── sub-004
│   │       └── eeg
│   ├── filtering
│   │   ├── code
│   │   ├── derivatives
│   │   │   └── downsampling
│   │   │       ├── code
│   │   │       ├── derivatives
│   │   │       │   └── rereference
│   │   │       │       ├── code
│   │   │       │       ├── sub-002
│   │   │       │       │   └── eeg
│   │   │       │       ├── sub-003
│   │   │       │       │   └── eeg
│   │   │       │       └── sub-004
│   │   │       │           └── eeg
│   │   │       ├── sub-002
│   │   │       │   └── eeg
│   │   │       ├── sub-003
│   │   │       │   └── eeg
│   │   │       └── sub-004
│   │   │           └── eeg
│   │   ├── sub-002
│   │   │   └── eeg
│   │   ├── sub-003
│   │   │   └── eeg
│   │   └── sub-004
│   │       └── eeg
│   └── rereference
│       ├── code
│       ├── sub-002
│       │   └── eeg
│       ├── sub-003
│       │   └── eeg
│       └── sub-004
│           └── eeg
├── sub-002
│   └── eeg
├── sub-003
│   └── eeg
└── sub-004
    └── eeg

As discussed yesterday, your review of these examples can be used to get concretre ideas what needs to be done to extend the specification. Note that AFAIK I have now created derivative datasets that are compliant according to BIDS version 1.6.0. I did not run the validator (as it does not work yet on derivatives).

Some known issues at the moment I created the derivatives:

README and CHANGES are not updated
Some of the metadata fields have suboptimal values (e.g. because the derived datasets are only on my disk)
the eeg.json metadata is very minimal

robertoostenveld commented 2 years ago

@arnodelorme wrote in a reply to my email

I feel for reproducibility purposes, we should indicate the version of MATLAB (and Fieldtrip) in the script. Also, the script could be named with a standard name (like derivative.m) to avoid confusion if there are several scripts in the folder.

For the script:

Add the DOI of the parent BIDS datasets (as a comment). There can be several snapshots for a dataset so the name only is not enough. NOTE THIS MIGHT BE A MAJOR FLAW OF THE CURRENT DERIVATIVE FORMAT AS THE “SourceDatasets” field of “dataset_description.json” does not allow to uniquely identify the dataset.

Add the version of MATLAB as a comment

Add the version of Fieldtrip as a comment

This could be an optional recommendation for people, so we maximize reproducibility. Or we could imagine a specific JSON file for that in the code folder (derivative.json).

I would not cascade BIDS derivative repositories if there is no intention of assigning a DOI to each one. In that case, (downsampling plus filtering), then create a single dataset that is both downsampled and filtered. This could avoid the proliferation of datasets and confusion.

Alternatively, we could assign a DOI to a cascaded BIDS dataset (so parent folder plus derivative folder plus the derivative of the derivative folder). However, adding a derivative folder means that we have to create a new DOI for the whole cascaded BIDS dataset, which does not seem like a good idea.

Also, we should consider in the derivative specification folders of name derivative, derivative2, derivative3, etc.…

Note the proposed architecture is opposite the behavior of the cfg var in Fieldtrip, where parents BIDS are stored in substructure cfg.cfg (unless I am mistaken). In BIDS, parent BIDS datasets are stored in the parent folder. I personally prefer when parents are stored in a subfolder (more intuitive). Also, this would avoid the problem of having multiple “derivative” folders (a dataset usually has a single parent, but a parent can have multiple children). That's a broader discussion, of course.

robertoostenveld commented 2 years ago

Looking at the nested derivatives data structure that I created, I realize two aspects. These are more fundamental that the discussion on which specific metadata fields are needed (like matlab and fieldtrip version, to which I agree).

Do we want to store provenance (filter settings etc.) along with each individual derivatives data file (i.e. duplicated), or do we want to store it along with the information about the pipeline? Note that if we were to store it with each datafile in the eeg.json, we could make use of inheritance and deduplicate it at the top level.
Do we want to store the provenance of the previous steps along with every derivative? This touches upon the data.cfg.previous.previous.previous strategy used in FieldTrip.

Regarding 1: it makes sense (and is needed) if each file were processed differently. But along the processing pipeline we usually make raw data that might be inconsistent at the individual participant level more and more consistent. If the same pipeline is used on all datafiles, then documenting the pipeline metadata once should suffice. Note that documenting it along with the data is like data.cfg.previous.previous.previous, whereas documenting it with (or in) the code is more similar to having the script available (e.g. on github, in the code directory, or as documented in the GeneratedBy metadata field).

Regarding 2: Assuming that we would only store provenance of the last step, then in my example the metadata of ds003645_selection/derivatives/rereference could not be distinguished from the metadata of ds003645_selection/derivatives/filtering/derivatives/downsampling/derivatives/rereference, as both have the same step as the last. Although @arnodelorme mentions that my sequential cascade of pipelines is not optimal w.r.t. .data handling (and I agree), the principle is that pipelines can be cascaded. My cascade should therefor be considered as an example that should be supported if a researcher would want to do so.

My item 2 also relates to what Arno mentions w.r.t. pointing in the derivative to its source. He discusses it in relation to DOIs and hence published/persistent datasets, whereas I was in creating the examples not thinking about publishing and hence more thinking about referencing the local (on my hard disk or my labs network disk) relation between datasets. This also relates to PR https://github.com/bids-standard/bids-specification/pull/820 and https://github.com/bids-standard/bids-specification/pull/821.

I think we all have some implicit expectation about how people (including ourselves) work and when to start a new derivative, or when to "weave" additional files into an existing derivatives. In general my directory organization looks like this

projectId/raw       # might be a symbolic link to data on another volume
projectId/code      # ideally also on github for version control and collaboration 
projectId/results

and as long as I keep on working on the pipelines in the code directory, the intermediate and final files continue to go into the results directory (which has some internal directory structure similar to BIDS). Once the analysis is finalized, I would clean up the code and results (prune dead branches, rename files for improved consistency, possibly rerun it all once more to ensure consistency and reproducibility) and consider publishing the results+code directories together. Those would then comprise the BIDS derivative. An example is this with raw and derivatives data, plus a copy of the code on github (a static version of the code is also included in the derivative).

The example that I prepared is however at a larger collaborative (and more FAIR) scale, where Daniel and Rik prepared the initial multimodal data ds000117, which was then handled by Dung and Arno, resulting in ds003645, which was handled by me resulting in ds003645_selection, which was then handled by the "filtering guy" resulting in derivatives/filtering, which was then handled by the "downsampling guy", etc, etc. That leads me to the question: which information do we want to be present at which level (e.g. is the EEG amplifier brand still relevant after 5 stages of processing?), and which information do we expect re-users of a dataset to look up at with the ancestor dataset (for which those need to be accessible).

agramfort commented 2 years ago

I have not looked in details but browsing the google drive I see that the entity desc is used while I would have imagine proc is more natural. But it's not a strong feeling here. The derivatives > derivatives > nested structure is coherent but maybe hard to navigate. If you have 6 steps of processing (maxfilter, temporal filtering, artifact rejection, epochs, averaging ...) it's a lot and I would have considered putting all the produced files in the same folder of derivatives for a given subject. With the proposal here is what I say a valid option? thx @robertoostenveld for the coordination

robertoostenveld commented 2 years ago

From the spec (emphasis mine): "The proc label is analogous to rec for MR and denotes a variant of a file that was a result of particular processing performed on the device. This is useful for files produced in particular by Elekta’s MaxFilter (for example, sss, tsss, trans, quat or mc), which some installations impose to be run on raw data because of active shielding software corrections before the MEG data can actually be exploited."

The existing entities res/den/label/desc are specifically applicable to derivative data, and desc is used "[w]hen necessary to distinguish two files that do not otherwise have a distinguishing entity...", which I why I used it with the short description of the pipeline.

My example pipelines only produced a single result per raw input file; the sequential application served to have us think about what happens if we pass derivatives around between each other or on openneuro. If you have a pipeline that produces multiple results (which also makes sense), then those can be placed next to each other. I imagine that could result in

sub-002_task-FacePerception_run-1_desc-pipelinenameresult1_eeg.eeg + vhdr + vmrk
sub-002_task-FacePerception_run-1_desc-pipelinenameresult2_eeg.eeg + vhdr + vmrk
sub-002_task-FacePerception_run-1_desc-pipelinenameresult3_eeg.eeg + vhdr + vmrk
sub-002_task-FacePerception_run-1_desc-pipelinenameresult4_eeg.eeg + vhdr + vmrk

where the description (hence desc) pipelinenameresultN combines both the pipeline name and the specific result. However, you could also imagine

sub-002_task-FacePerception_run-1_desc-pipelinename_xxx-result1_eeg.eeg + vhdr + vmrk
sub-002_task-FacePerception_run-1_desc-pipelinename_xxx-result2_eeg.eeg + vhdr + vmrk
sub-002_task-FacePerception_run-1_desc-pipelinename_xxx-result3_eeg.eeg + vhdr + vmrk
sub-002_task-FacePerception_run-1_desc-pipelinename_xxx-result4_eeg.eeg + vhdr + vmrk

where xxx codes another to-be-defined entity to split the pipeline from the result.

I don't think that we will benefit from long file names with specific entities for specific pipelines (such as filt-lp and filt-hp for different types of filters, or ref-avg and ref-common for different rereferencing schemes). The entity list is already very long and if we want to have a unique entity for each step, combined with prescribed values for the label, the specification would have to become very elaborate. I rather think that we would benefit from a simple desc-pipelinenameresultN where we would specify a very clear/explicit mapping of pipelinenameresultN to a metadata field inside the corresponding JSON. A bit similar to how we have the required json metadata field TaskName for functional data (which maps onto task-<TaskName> in the filename), combined with the recommended TaskDescription metadata field. It could then be DescName and DescDescription (which sounds silly) or so. However, I think the existing GeneratedBy also serves the purpose just fine.

An important difference between GeneratedBy and extending the filename of individual files (and/or the JSON metadata of individual files) is whether we document the pipeline once in dataset_description.json, or many times for each resulting file.

smakeig commented 2 years ago

Robert - Thanks for pointing to possible differences between processing data on your local hard drive and on an open resource (what I am proposing to call a DATCOR, for 'data, tools and compute resource') such as we are attempting in NEMAR - although you are thinking of the distinction between your (active) local disk store and a passive archive (data library model). In EEGLAB, we expected users to make successiv copies of the data as processing progressed -- but we didn't (yet!) take the next logically step you are proposing, to formalize such dependent data structures (heterarchies) -- here, under BIDS. In building NEMAR, we are facing the problem in part from the start, as we are contracted to compute and provide users visualizations of a variety of data measures, obtained by running each NEMAR dataset through a standard pipeline. It makes little sense to go through the process and then delete the preprocessed dataset, requiring anyone wanting to follow up in the directions leading to the visualizations we offer to recompute it. On the other hand, it could be wasteful and overly complex to store and make available a whole tree of intermediate processed data objects as in your example ... This is a decision we will have to make in part in view of constraints particular to the NEMAR project.

However, it seems to me that the whole of the BIDS metadata for a data 'scan' (there should be a less parochial BIDS term for the data object) is typically much smaller than the data itself (even with full HED annotation included), so it is much preferable to carry it along as a part of the derived data objects. This allows those objects to be e.g. sent somewhere for further computation without loss of crucial information -- though I suppose a BIDS tool could be written to gather all the metadata associated with a data object from its tree to send with it (making it a metadata-complete object). Does such a tool already exist?

Scott

robertoostenveld commented 2 years ago

Hi Scott,

I am not aware of a tool that gathers all metadata from a data object in BIDS to send it along elsewhere. Upon ingestion of sourcedata in a BIDS dataset you could say that dcm2niix is such a tool, or FieldTrip's data2bids, as both try to get as much information as possible from the sourcedata. Upon reading a data object from a BIDS dataset my ft_read_header will read the associated metadata from the JSONs (see https://github.com/fieldtrip/fieldtrip/blob/fe125f4d39a33a2ec7c619403e1f83aaa1e007e7/fileio/ft_read_header.m#L2865), but right now it only reads the minimal and not the complete metadata. I think that can be improved, so I just opened this issue for FT https://github.com/fieldtrip/fieldtrip/issues/1967.

With FieldTrip we have “dataout.cfg.previous.previous.etc" to keep provenance (see here https://youtu.be/7B4rDZYwQLM?t=3469 for a 1-minute explanation). We have experienced that over the time (i.e. after multiple processing steps) the data tends to get smaller and smaller (e.g. averaged ERPs, or group stats), whereas the provenance only gets larger and larger. So after some time the hierarchical provenance does get much larger than the actual processed data.

I remembered another place where we have an example for potential guidance: in the MEG pipelines for HCP (which was pre-BIDS) we also kept provenance. I’ll dig that up and share here as well.

best Robert

On 2 Feb 2022, at 23:39, smakeig @.***> wrote:

Robert - Thanks for pointing to possible differences between processing data on your local hard drive and on an open resource (what I am proposing to call a DATCOR, for 'data, tools and compute resource') such as we are attempting in NEMAR - although you are thinking of the distinction between your (active) local disk store and a passive archive (data library model). In EEGLAB, we expected users to make successiv copies of the data as processing progressed -- but we didn't (yet!) take the next logically step you are proposing, to formalize such dependent data structures (heterarchies) -- here, under BIDS. In building NEMAR, we are facing the problem in part from the start, as we are contracted to compute and provide users visualizations of a variety of data measures, obtained by running each NEMAR dataset through a standard pipeline. It makes little sense to go through the process and then delete the preprocessed dataset, requiring anyone wanting to follow up in the directions leading to the visualizations we offer to recompute it. On the other hand, it could be wasteful and overly complex to store and make available a whole tree of intermediate processed data objects as in your example ... This is a decision we will have to make in part in view of constraints particular to the NEMAR project.

However, it seems to me that the whole of the BIDS metadata for a data 'scan' (there should be a less parochial BIDS term for the data object) is typically much smaller than the data itself (even with full HED annotation included), so it is much preferable to carry it along as a part of the derived data objects. This allows those objects to be e.g. sent somewhere for further computation without loss of crucial information -- though I suppose a BIDS tool could be written to gather all the metadata associated with a data object from its tree to send with it (making it a metadata-complete object). Does such a tool already exist?

Scott

— Reply to this email directly, view it on GitHub https://github.com/bids-standard/bep021/issues/5#issuecomment-1028425234, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG3PYZCK6IF7CRLGMVQHLLUZGXDDANCNFSM4XUNFFSA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.

jesscall commented 2 years ago

Hi BEP021 community,

I've recently joined (thanks to @CPernet ) as part of a new initiative called EEGNet headed from the Montreal Neurological Institute with @christinerogers and built on LORIS.

1- We are working with BIDS-EEG derivative data from @jadesjardins which includes continuous time annotations - originally generated and stored in EEGLab VisEd Marks. Could sharing an example file from our dataset help resolve the discussion on continuous annotation in the spec? We'd be glad to contribute samples and help move this section forward.

2 - In a BIDS file structure with raw and derivative data, how should the data inheritance be practically implemented for derivatives? In implementing our data platform, we are exploring how data from the raw folder - e.g. recording parameters, events, electrodes, channels - should be accessed for derivatives, since they're frequently needed for analysis/QC of derivatives but inefficient to extract from the file structure..

Can anyone advise?

I hope this is the right place for my questions. We're also looking forward to joining a future BEP021 call.

Thanks, Jessica Callegaro @jesscall Montreal Neurological Institute | EEGNet.org

robertoostenveld commented 2 years ago

Hi @jesscall and welcome!

Regarding annotations: Yes, I think that sharing and reviewing representative data (which can be small subsets) will help.

Regarding raw and derivative data: each dataset (be it raw or derived) should be interpretable by itself. Metadata that is present in the raw dataset but also needed for proper interpretation of the derivative therefore must be replicated in the derivative. Extracting the (rather limited and mostly technical) metadata from the EEG files can for example be done using the "decorate" option in data2bids but I think that the more relevant user-supplied metadata is already represented in the raw JSON files; those can be read and copied over from the raw dataset to the derivatives dataset.

smakeig commented 2 years ago

Robert -

Can you say how, when "the provenance only gets larger and larger." -- do you mean it does in a super-linear fashion?? Even interpretation of a single-channel ERP (constituting, say, some terminal derived dataset) might well require metadata whose type might be hard to decide in advance ...

Scott

On Thu, Feb 3, 2022 at 3:14 AM Robert Oostenveld @.***> wrote:

Hi Scott,

I am not aware of a tool that gathers all metadata from a data object in BIDS to send it along elsewhere. Upon ingestion of sourcedata in a BIDS dataset you could say that dcm2niix is such a tool, or FieldTrip's data2bids, as both try to get as much information as possible from the sourcedata. Upon reading a data object from a BIDS dataset my ft_read_header will read the associated metadata from the JSONs (see https://github.com/fieldtrip/fieldtrip/blob/fe125f4d39a33a2ec7c619403e1f83aaa1e007e7/fileio/ft_read_header.m#L2865 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_fieldtrip_fieldtrip_blob_fe125f4d39a33a2ec7c619403e1f83aaa1e007e7_fileio_ft-5Fread-5Fheader.m-23L2865&d=DwQFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=1Ri-YN6C06NthKOrPyiGN0BBVV-JC4T7DkNuBqIbFzn9HzXJCbDwMSnsw26worRo&s=3DvAE0Hgx3WjmnYHHXe6OhjLmaJuXT5frOjF33KQkfA&e= < https://github.com/fieldtrip/fieldtrip/blob/fe125f4d39a33a2ec7c619403e1f83aaa1e007e7/fileio/ft_read_header.m#L2865> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_fieldtrip_fieldtrip_blob_fe125f4d39a33a2ec7c619403e1f83aaa1e007e7_fileio_ft-5Fread-5Fheader.m-23L2865-253E&d=DwQFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=1Ri-YN6C06NthKOrPyiGN0BBVV-JC4T7DkNuBqIbFzn9HzXJCbDwMSnsw26worRo&s=JY8R-a_iCmQ0otLujDCKE70VhUDYD4wPVGLJ81rglZ4&e=), but right now it only reads the minimal and not the complete metadata. I think that can be improved, so I just opened this issue for FT https://github.com/fieldtrip/fieldtrip/issues/1967 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_fieldtrip_fieldtrip_issues_1967&d=DwQFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=1Ri-YN6C06NthKOrPyiGN0BBVV-JC4T7DkNuBqIbFzn9HzXJCbDwMSnsw26worRo&s=Ts0ABzpMQzvoN_higaNX_vkRpfIG_Re9N7ILKjUbdxI&e= https://github.com/fieldtrip/fieldtrip/issues/1967 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_fieldtrip_fieldtrip_issues_1967-253E&d=DwQFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=1Ri-YN6C06NthKOrPyiGN0BBVV-JC4T7DkNuBqIbFzn9HzXJCbDwMSnsw26worRo&s=gaZ6wO_3axjg_OsUojhw3g1I85pJDSOJ4oo2gV-nhmM&e=

With FieldTrip we have “dataout.cfg.previous.previous.etc" to keep provenance (see here https://youtu.be/7B4rDZYwQLM?t=3469 https://urldefense.proofpoint.com/v2/url?u=https-3A__youtu.be_7B4rDZYwQLM-3Ft-3D3469&d=DwQFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=1Ri-YN6C06NthKOrPyiGN0BBVV-JC4T7DkNuBqIbFzn9HzXJCbDwMSnsw26worRo&s=i56_pbOs2kK49tUptcP6QMtl2L3wh25LrHLvyF0hO0s&e= https://youtu.be/7B4rDZYwQLM?t=3469 https://urldefense.proofpoint.com/v2/url?u=https-3A__youtu.be_7B4rDZYwQLM-3Ft-3D3469-253E&d=DwQFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=1Ri-YN6C06NthKOrPyiGN0BBVV-JC4T7DkNuBqIbFzn9HzXJCbDwMSnsw26worRo&s=lKgyrAU6FvcBl7lXFjw2Xglx48kDX4s9hTQlAWddq0M&e= for a 1-minute explanation). We have experienced that over the time (i.e. after multiple processing steps) the data tends to get smaller and smaller (e.g. averaged ERPs, or group stats), whereas the provenance only gets larger and larger. So after some time the hierarchical provenance does get much larger than the actual processed data.

I remembered another place where we have an example for potential guidance: in the MEG pipelines for HCP (which was pre-BIDS) we also kept provenance. I’ll dig that up and share here as well.

best Robert

On 2 Feb 2022, at 23:39, smakeig @.***> wrote:

Robert - Thanks for pointing to possible differences between processing data on your local hard drive and on an open resource (what I am proposing to call a DATCOR, for 'data, tools and compute resource') such as we are attempting in NEMAR - although you are thinking of the distinction between your (active) local disk store and a passive archive (data library model). In EEGLAB, we expected users to make successiv copies of the data as processing progressed -- but we didn't (yet!) take the next logically step you are proposing, to formalize such dependent data structures (heterarchies) -- here, under BIDS. In building NEMAR, we are facing the problem in part from the start, as we are contracted to compute and provide users visualizations of a variety of data measures, obtained by running each NEMAR dataset through a standard pipeline. It makes little sense to go through the process and then delete the preprocessed dataset, requiring anyone wanting to follow up in the directions leading to the visualizations we offer to recompute it. On the other hand, it could be wasteful and overly complex to store and make available a whole tree of intermediate processed data objects as in your example ... This is a decision we will have to make in part in view of constraints particular to the NEMAR project.

However, it seems to me that the whole of the BIDS metadata for a data 'scan' (there should be a less parochial BIDS term for the data object) is typically much smaller than the data itself (even with full HED annotation included), so it is much preferable to carry it along as a part of the derived data objects. This allows those objects to be e.g. sent somewhere for further computation without loss of crucial information -- though I suppose a BIDS tool could be written to gather all the metadata associated with a data object from its tree to send with it (making it a metadata-complete object). Does such a tool already exist?

Scott

— Reply to this email directly, view it on GitHub < https://github.com/bids-standard/bep021/issues/5#issuecomment-1028425234> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_bids-2Dstandard_bep021_issues_5-23issuecomment-2D1028425234-253E&d=DwQFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=1Ri-YN6C06NthKOrPyiGN0BBVV-JC4T7DkNuBqIbFzn9HzXJCbDwMSnsw26worRo&s=ZHBq2unO4gnFPgHUzA2GefIEt0-Z2A3hiIOsuUf_EqM&e=, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAG3PYZCK6IF7CRLGMVQHLLUZGXDDANCNFSM4XUNFFSA> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAG3PYZCK6IF7CRLGMVQHLLUZGXDDANCNFSM4XUNFFSA-253E&d=DwQFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=1Ri-YN6C06NthKOrPyiGN0BBVV-JC4T7DkNuBqIbFzn9HzXJCbDwMSnsw26worRo&s=oVfwMAAV1aEIKk4AFbsxOJEb0E7HxOOQbOUr3ryFh1k&e= . Triage notifications on the go with GitHub Mobile for iOS < https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> https://urldefense.proofpoint.com/v2/url?u=https-3A__apps.apple.com_app_apple-2Dstore_id1477376905-3Fct-3Dnotification-2Demail-26mt-3D8-26pt-3D524675-253E&d=DwQFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=1Ri-YN6C06NthKOrPyiGN0BBVV-JC4T7DkNuBqIbFzn9HzXJCbDwMSnsw26worRo&s=7UYZIY7-Ecb0kWMoQrAUrk3VLTUTsbZ3EkJr5ZehVeo&e= or Android < https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub> https://urldefense.proofpoint.com/v2/url?u=https-3A__play.google.com_store_apps_details-3Fid-3Dcom.github.android-26referrer-3Dutm-5Fcampaign-253Dnotification-2Demail-2526utm-5Fmedium-253Demail-2526utm-5Fsource-253Dgithub-253E&d=DwQFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=1Ri-YN6C06NthKOrPyiGN0BBVV-JC4T7DkNuBqIbFzn9HzXJCbDwMSnsw26worRo&s=RJZw6etK_3sjCfjKhyo2q0iWpVFvlTBnVNvy7Q_uNtE&e=.

You are receiving this because you were mentioned.

— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_bids-2Dstandard_bep021_issues_5-23issuecomment-2D1028710359&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=1Ri-YN6C06NthKOrPyiGN0BBVV-JC4T7DkNuBqIbFzn9HzXJCbDwMSnsw26worRo&s=488KCIRMdmRR0C3tU1fKgTBBJxX1UHayuWz_WJj_IrU&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AKN2SFQYWUAH7XOKEKLSLJDUZI2NBANCNFSM4XUNFFSA&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=1Ri-YN6C06NthKOrPyiGN0BBVV-JC4T7DkNuBqIbFzn9HzXJCbDwMSnsw26worRo&s=4fH3Ba6tqHlJGj-w4cqzEcVm33NwMNy_-DQ0DsuuyRg&e= . Triage notifications on the go with GitHub Mobile for iOS https://urldefense.proofpoint.com/v2/url?u=https-3A__apps.apple.com_app_apple-2Dstore_id1477376905-3Fct-3Dnotification-2Demail-26mt-3D8-26pt-3D524675&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=1Ri-YN6C06NthKOrPyiGN0BBVV-JC4T7DkNuBqIbFzn9HzXJCbDwMSnsw26worRo&s=2RcEoiLxj-WLtWrG-q4FNR9DZ-6K8WvRWAZqJdSI46o&e= or Android https://urldefense.proofpoint.com/v2/url?u=https-3A__play.google.com_store_apps_details-3Fid-3Dcom.github.android-26referrer-3Dutm-5Fcampaign-253Dnotification-2Demail-2526utm-5Fmedium-253Demail-2526utm-5Fsource-253Dgithub&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=1Ri-YN6C06NthKOrPyiGN0BBVV-JC4T7DkNuBqIbFzn9HzXJCbDwMSnsw26worRo&s=iqZtYTQhw3PAyQ3e6Pj77mZEsNqKd4RTx55mGnFWlNw&e=.

You are receiving this because you were mentioned.Message ID: @.***>

CPernet commented 2 years ago

Hi @jesscall

as we discussed, and now updated by Robert in the spec (https://docs.google.com/document/d/1PmcVs7vg7Th-cGC-UrX8rAhKUHIzOI-uIOh69_mvdlw/edit) these continuous annotations should be classified as 'raw' via tsv files as described here https://bids-specification.readthedocs.io/en/stable/04-modality-specific-files/06-physiological-and-other-continuous-recordings.html

but an example would be really good - if @jadesjardins is ok with that, put 0Kb data and annotated tsv of a subjects or two in a fork of https://github.com/bids-standard/bids-examples/tree/master

robertoostenveld commented 2 years ago

@smakeig wrote

Can you say how, when "the provenance only gets larger and larger." -- do you mean it does in a super-linear fashion??

If you take a raw dataset, make a derivative, and from that another derivative, and from that yet another derivative, AND if you want to store all provenance (going back to the raw data) in the very last derivative dataset, the chain of provenance gets larger, whereas the (processed and interpreted) data tends to get smaller (like a single condition averaged ERP or ERSP).

Considering only the provenance size, it grows linear. Considering the relative size of the provenance (which might get successively larger) versus the size of the data, it grows super-linear (as the nominator gets larger and the denominator smaller). Considering the complexity to produce the provenance, but also for a human to parse it, that does not scale linear anyway: having more than 7 things in working memory increases the difficulty exponentially, and quickly to an unmanageable level (unless you have chunking strategies).

robertoostenveld commented 2 years ago

_TL;DR - I have compared my previous BIDS example to the HCP. For MEG data in HCP we stored provenance per file. We could do the same in BIDS with the desc-<pipeline>_eeg.json files, and reduce redundancy by using the inheritance principle._

In https://github.com/bids-standard/bep021/issues/5#issuecomment-1028710359 I promised to look up and share how we handled provenance for the MEG component of the HCP.

I have just added data from a single subject to my shared google drive folder. All data files (MEG, mat, nii, gii) have been truncated to zero bytes. If you click around, you will find provenance directories throughout, which for each result file have a corresponding xml file with its provenance. Important to mention that this is part of the 900 subjects release and includes all raw data plus results of a whole slew of MEG processing pipelines.

The raw and processed HCP MEG data for a single subject amounts to 64GB. After truncating the data files to zero bytes, 31 MB remains (that includes some txt results and excel sheets). As we have a guaranteed provenance file for each individual result thanks to hcp_write_matlab.m and companions for the other output formats , I can count the number of results with

mac036-lan> find . -name \*.xml | wc -l
    2252

The provenance for each of the 2252 results (aka output files) looks very similar, here is an example

<?xml version="1.0" encoding="utf-8"?>
<megconnectome xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:noNamespaceSchemaLocation="megconnectome.xsd">
   <version>
      <matlab>
         <Name>MATLAB</Name>
         <Version>8.0</Version>
         <Release>(R2012b)</Release>
         <Date>20-Jul-2012</Date>
      </matlab>
      <megconnectome>
         <Name>megconnectome</Name>
         <Version>3.0</Version>
         <Release>www.humanconnectome.org</Release>
         <Date>03-Jul-2015</Date>
      </megconnectome>
      <fieldtrip>
         <Name>FieldTrip</Name>
         <Version>r10442</Version>
         <Release>fieldtriptoolbox.org</Release>
         <Date>09-Jun-2015</Date>
      </fieldtrip>
   </version>
   <compiled>true</compiled>
   <username>f.dipompeo</username>
   <hostname>node018</hostname>
   <architecture>glnxa64</architecture>
   <buildtimestamp>03-Jul-2015 11:34:31</buildtimestamp>
   <pwd>/HCP/scratch/meg/intradb/archive1/HCP_Phase2/arc001/177746_MEG/RESOURCES/rmeg</pwd>
   <matlabstack>
In hcp_write_provenance at 39
In hcp_write_figure at 77
In megconnectome at 129
</matlabstack>
   <script>
      <filename>/HCP/scratch/meg/release/software/megconnectome-3.0/pipeline_scripts/hcp_icamne.m</filename>
      <md5sum>8b6151012ea73b2349126db556782045</md5sum>
   </script>
   <filename>177746_MEG_3-Restin_icamne_1.png</filename>
   <md5sum>3ca65ffd3fbaa2a7cb2469c6a6ce7e33</md5sum>
</megconnectome>

It contains computer details, MATLAB details, FieldTrip details, the megconnectome package details (we carefully managed the releases for that purpose) and the details on the specific analysis pipeline script that resulted in the specific output file.

Furthermore, to ensure consistent code over all participants, consistent execution, and data sharing, the output contains the md5sum of the analysis pipeline script and of the resulting data file.

Comparing this to the existing BIDS specification, the computer, matlab and fieldtrip details could all be represented in the dataset_description.json. If you compare this to my previously outlined example (also on the google drive): that has the code in the code directory.

The main difference that I see between the HCP provenance and the one in my ds003645_selection/derivatives/filtering/derivatives/downsampling/derivatives/rereference/code example (besides the file format and where the provenance is stored) is whether provenance is kept for each file (HCP), or whether a single provenance is kept at the dataset level (my BIDS example). Keeping multiple provenances makes more sense if you have multiple outputs of the pipeline. In line with BIDS, each output data file could then get its own json file with provenance details (except for the ones that can be represented at the top level in dataset_description). To prevent duplication, the BIDS inheritance principle could be applied, resulting in a single desc-<pipeline>_eeg.json file at the top level of the derivative dataset.

smakeig commented 2 years ago

"having more than 7 things in working memory increases the difficulty exponentially, and quickly to an unmanageable level (unless you have chunking strategies)."

That is why provenance is useful - as a (less fallible) external memory aid ... Your suggestion of (as I understand it) accumulating the dataset provenance in a top level file seems good.

Scott

On Wed, Feb 16, 2022 at 6:52 AM Robert Oostenveld @.***> wrote:

@smakeig https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_smakeig&d=DwMCaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=kH4x9UetSKtp8i1QYBd_DAt59woIH_qBveKgXHcqXCQlG3290P1GrtpkKD2MPTUr&s=qdmQO56KDO0ejAKX_dZt1kYNaxNtUeku3oQLUOa4D78&e= wrote

Can you say how, when "the provenance only gets larger and larger." -- do you mean it does in a super-linear fashion??

If you take a raw dataset, make a derivative, and from that another derivative, and from that yet another derivative, AND if you want to store all provenance (going back to the raw data) in the very last derivative dataset, the chain of provenance gets larger, whereas the (processed and interpreted) data tends to get smaller (like a single condition averaged ERP or ERSP).

Considering only the provenance size, it grows linear. Considering the relative size of the provenance (which might get successively larger) versus the size of the data, it grows super-linear (as the nominator gets larger and the denominator smaller). Considering the complexity to produce the provenance, but also for a human to parse it, that does not scale linear anyway: having more than 7 things in working memory increases the difficulty exponentially, and quickly to an unmanageable level (unless you have chunking strategies).

— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_bids-2Dstandard_bep021_issues_5-23issuecomment-2D1041409404&d=DwMCaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=kH4x9UetSKtp8i1QYBd_DAt59woIH_qBveKgXHcqXCQlG3290P1GrtpkKD2MPTUr&s=k3E0XdZ0dValMDBmxpvEu0_205OtD8N00ozYftgHCz8&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AKN2SFUVXYUXVZ246CTBWHDU3OFXPANCNFSM4XUNFFSA&d=DwMCaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=kH4x9UetSKtp8i1QYBd_DAt59woIH_qBveKgXHcqXCQlG3290P1GrtpkKD2MPTUr&s=4hY289eeDjs0Vi5mlSwFa8NGiAdkNcvjWeMcFFSTie4&e= . Triage notifications on the go with GitHub Mobile for iOS https://urldefense.proofpoint.com/v2/url?u=https-3A__apps.apple.com_app_apple-2Dstore_id1477376905-3Fct-3Dnotification-2Demail-26mt-3D8-26pt-3D524675&d=DwMCaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=kH4x9UetSKtp8i1QYBd_DAt59woIH_qBveKgXHcqXCQlG3290P1GrtpkKD2MPTUr&s=Mm9W8IPps9IfAR-N5eLr62inOipDjC-sEbYSplxdi88&e= or Android https://urldefense.proofpoint.com/v2/url?u=https-3A__play.google.com_store_apps_details-3Fid-3Dcom.github.android-26referrer-3Dutm-5Fcampaign-253Dnotification-2Demail-2526utm-5Fmedium-253Demail-2526utm-5Fsource-253Dgithub&d=DwMCaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=kH4x9UetSKtp8i1QYBd_DAt59woIH_qBveKgXHcqXCQlG3290P1GrtpkKD2MPTUr&s=1Sta35Bws-kpV0xYzu6_uFZzEOKUsdLXyZlAhUCZa2Q&e=.

You are receiving this because you were mentioned.Message ID: @.***>

jesscall commented 2 years ago

@CPernet @robertoostenveld

Thanks for the responses. I'll look into the linked documents and update my comment here with further questions.

Update:

I will be taking this offline and following up with @SaraStephenson and @jadesjardins in order to get an example dataset ready and uploaded.

robertoostenveld commented 2 years ago

Yesterday we have a BIDS steering group meeting (@guiomar also attended) and it was mentioned there that BEP028 is making good progress. I will study that, you might want to look at that as well.

Furthermore, we also shortly touched upon the two tangential motivations for BIDS in making the results of analysis replicable (e.g. be able to recompute) and making raw or derived data reusable (for follow-up analyses). The first requires extensive details, the second can also be achieved with minimal metadata.

Also (as I was reminded in the meeting yesterday), the overarching BIDS strategy is to keep things as simple and small as possible, and we consider the 80/20 pareto principle.

SaraStephenson commented 2 years ago

Hello All,

With the help of @CPernet and @jesscall, @jadesjardins and I have prepared an example of the Face13 dataset with annotations stored in .tsv files and described in .json files. This current example is for discussion surrounding how to store continuous time annotations.

These files are located within the bids-examples/eeg_face13/derivatives/BIDS-Lossless-EEG/sub-*/eeg folders.

The annotations in this example were produced by the EEG-IP-L pipeline. There are several different types of annotations from this pipeline, including channel annotations, component annotations, binary time annotations and non-binary (continuous) time annotations.

The EEG-IP-L pipeline currently produces an annotations.json, annotations.tsv, and annotations.mat file. The .json describes all of the pipeline annotations. The .tsv contains the channel, component and binary time annotations. The .mat file contains the continuous time annotations. Since the .mat file is not a part of the BIDS specification, this current example has added a ‘recording-marks_annotation.tsv.gz’ and an accompanying 'recording-marks_annotation.json' for continuous time annotations. The 'recording-marks_annotation.tsv.gz' and the .json file were created based on the BIDS spec for storing physiological and other continuous recordings.

If we are to store continuous time annotations in a tsv file, one concern we have is the need for two annotations tsv files because the non-binary time annotations are stored differently than the binary time annotations, component annotations, and channel annotations. As all of these annotation types are important for the EEG-IP-L pipeline, we are looking forward to some suggestions around how they can best be stored in BIDS.

Thanks, Sara Stephenson

robertoostenveld commented 2 years ago

Thanks @SaraStephenson!

To help others that want to look at it on their own computer: I just did this to get the changes (which are on a branch that is 160 commits behind and 1 commit ahead of HEAD)

cd  bids-examples
git checkout master
git pull origin master # where origin points to git@github.com:bids-standard/bids-examples.git
git checkout -b BUCANL-bep021_ephys_derivatives
git checkout 2105c6d23eddd657e2305fbb5181242c1b1b1545 # go back in time, to avoid conflicts elsewhere
git pull --ff-only git@github.com:BUCANL/bids-examples.git bep021_ephys_derivatives

CPernet commented 2 years ago

YES Robert! thx -- I'll also have a look as well now of course raw or derivatives - even if pushed in bep21, we went for raw here

robertoostenveld commented 2 years ago

@SaraStephenson Let me comment on what I encounter while going through the data.

first in derivatives/BIDS-Lossless-EEG

dataset_description.json

even though it is now nested in a dataset, I would prefer SourceDatasets still to be specified with a DOI.
I cannot find version 0.1.0 Alpha, better would be to either tag that and/or make a github release, or use a git SHA as version

README and README.md are duplicates.

The LICENSE file applies to the code, but seems inappropriate to the data, i.e. it is not a data use agreement. The source eeg_face13 is ODbL, that also applies to derivatives (since share-alike). I recommend to add the license not only as a file, but also to dataset_description.json. Perhaps you want to move the LICENSE file to the code directory.

Rather than linking to https://jov.arvojournals.org/article.aspx?articleid=2121634 I recommend to link to https://doi.org/10.1167/13.5.22

The file eeg_face13/task-faceFO_events.json can be an empty object {} but not an empty list []. Better would be if it were to explain a bit about the events.tsv files.

The electrode files are identical for all subjects; that suggests that they are not measured but a template. It is not recommended to add template data to the individual subjects. If you want to add a single template to all subjects, better put it at the top level (i.e. following the inheritance principle).

The IntendedFor path is inconsistent between sub-001_task-faceFO_annotations.json and sub-001_task-faceFO_desc-qc_annotations.json (one has ./ in front, the other not).

It is not clear to me what the difference is between sub-001_task-faceFO_annotations.tsv and sub-001_task-faceFO_desc-qc_annotations.tsv.

I don't think that the SamplingFrequency field in sub-002_task-faceFO_annotations.json is needed. The corresponding TSV file is not expressed in samples, but in seconds.

I don't think that EDF is the optimal binary format for processed EEG data. EDF is limited to 16 bits, whereas the data was recorded with 24 bit (since Biosemi) and subsequently processed as single or even double precision. I recommend writing to the BrainVision format, that allows single precision floats to be represented. Or to EEGLAB .set.

The way you coded two things in the annotations.tsv files to me appear to be nearly orthogonal and not relating to each other: the first few rows (with chan and comp) don't relate to onset and duration, and the latter rows don't relate to channels. Each row has a label, but the chan_xxx and comp_xxx labels appear to be very different from all others. Would it not be better to have those in two TSV files? Or possibly even three: desc-chan_annotations.tsv, desc-comp_annotations.tsv and desc-task_annotations.tsv file?

I am not sure (cannot check, since zero bytes) what is in the mat files.

There is a sub-001_task-faceFO_recording-marks_annotation.json file with a data dictionary, but no corresponding data. I would expect that to come with a TSV file (even when it would be empty, i.e. only with the first header row).

Not related to the derivative, but I noticed a typo in eeg_face13/sub-003/eeg/sub-003_task-faceFO_eeg.json: McMaster Univertisy rather than University.

Moving on to BIDS-Seg-Face13-EEGLAB:

Again duplicate README files.

Again PipelineDescription.Version being unfindable on github. Also here the dataset_description could contain more info. There is no license (should be ODbL, since derivatives from the original data should be share-alike).

I cannot review anything else at this level any more (since only binary files), which is not a problem per see.

arnodelorme commented 2 years ago

I think the discussion is fruitful. There is the issue of annotation and then there is the issue of derivatives data structures

Let me address the issue of derivative. I think it is fine to generate the full derivative tree as long as it can be cleaned up. The alternative to having the full hierarchy is to have pipelines as described in Robert’s email of Feb 2, 2022 so I think this strategy covers both approaches.

Three comments

Hierarchy. It is a detail, but I would prefer a hierarchy where the name of the derivate is appended to derivative folder. For example derivative-filtering, then subfolder, derivative-downsampling. I think it is closer to the current BIDS implementation (simply need to allow wildcard after “derivative”). Also simpler for user browsing (half the number of sub-folders to dig into). So instead of ds003645_selection/derivative/filtering/derivative/downsampling we would have ds003645_selection/derivative-filtering/derivative-downsampling. I am expecting Robert might have resistance to that (he always have a very good reason to do thing the way he does :-). Maybe we can vote?
Reproducibility. For the final derivative tree (to be published)
- each branch should have a DOI and can reference the DOI of the parent instead of being embedded in it (so you can share the derivative folder directly without loosing tracking)
- We need tools that can reiterate the tree from the raw data and the code in “code” folders and subfolders for quality control
- Maybe in the code folder a JSON file, with field, software (i.e. Fieldtrip, EEGLAB, MNE), language (python, MATLAB), and dependencies which would contain a list as well (with name and version, for example for EEGLAB plugins, or other dependencies), and then a field with “script” that could contain the script name in the same folder to execute using the parent BIDS to obtain the current derivative

{ "software": { "name": "EEGLAB", "version": "2022.0", "url": "xxxx" }, "language": { "name": "MATLAB", "version": "2021b" }, "dependencies": [{ "name": "bids-mjatlab-tools", "version": "6.1" }, { "name": "Fieldtrip", "version": "2022_03_10" } ], "script": { "name": "my_pipeline.m" } }

New data file types. We need to define new EEG data files which can be reused (for group analysis etc…) in addition to the processed EEG. For example, Liedfield matrix, ERP/ERSP results, ICA, custom results, etc...

Arno

On Mar 10, 2022, at 1:08 AM, Robert Oostenveld @.***> wrote:

@SaraStephenson Let me comment on what I encounter while going through the data.

first in derivatives/BIDS-Lossless-EEG

dataset_description.json

• even though it is now nested in a dataset, I would prefer SourceDatasets still to be specified with a DOI. • I cannot find version 0.1.0 Alpha, better would be to either tag that and/or make a github release, or use a git SHA as version README and README.md are duplicates.

The LICENSE file applies to the code, but seems inappropriate to the data, i.e. it is not a data use agreement. The source eeg_face13 is ODbL, that also applies to derivatives (since share-alike). I recommend to add the license not only as a file, but also to dataset_description.json. Perhaps you want to move the LICENSE file to the code directory.

Rather than linking to https://jov.arvojournals.org/article.aspx?articleid=2121634 I recommend to link to https://doi.org/10.1167/13.5.22

The file eeg_face13/task-faceFO_events.json can be an empty object {} but not an empty list []. Better would be if it were to explain a bit about the events.tsv files.

The electrode files are identical for all subjects; that suggests that they are not measured but a template. It is not recommended to add template data to the individual subjects. If you want to add a single template to all subjects, better put it at the top level (i.e. following the inheritance principle).

The IntendedFor path is inconsistent between sub-001_task-faceFO_annotations.json and sub-001_task-faceFO_desc-qc_annotations.json (one has ./ in front, the other not).

It is not clear to me what the difference is between sub-001_task-faceFO_annotations.tsv and sub-001_task-faceFO_desc-qc_annotations.tsv.

I don't think that the SamplingFrequency field in sub-002_task-faceFO_annotations.json is needed. The corresponding TSV file is not expressed in samples, but in seconds.

I don't think that EDF is the optimal binary format for processed EEG data. EDF is limited to 16 bits, whereas the data was recorded with 24 bit (since Biosemi) and subsequently processed as single or even double precision. I recommend writing to the BrainVision format, that allows single precision floats to be represented. Or to EEGLAB .set.

The way you coded two things in the annotations.tsv files to me appear to be nearly orthogonal and not relating to each other: the first few rows (with chan and comp) don't relate to onset and duration, and the latter rows don't relate to channels. Each row has a label, but the chan_xxx and comp_xxx labels appear to be very different from all others. Would it not be better to have those in two TSV files? Or possibly even three: desc-chan_annotations.tsv, desc-comp_annotations.tsv and desc-task_annotations.tsv file?

I am not sure (cannot check, since zero bytes) what is in the mat files.

There is a sub-001_task-faceFO_recording-marks_annotation.json file with a data dictionary, but no corresponding data. I would expect that to come with a TSV file (even when it would be empty, i.e. only with the first header row).

Not related to the derivative, but I noticed a typo in eeg_face13/sub-003/eeg/sub-003_task-faceFO_eeg.json: McMaster Univertisy rather than University.

Moving on to BIDS-Seg-Face13-EEGLAB:

Again duplicate README files.

Again PipelineDescription.Version being unfindable on github. Also here the dataset_description could contain more info. There is no license (should be ODbL, since derivatives from the original data should be share-alike).

I cannot review anything else at this level any more (since only binary files), which is not a problem per see.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.

SaraStephenson commented 2 years ago

Thank you for all your comments about the Face13 example @robertoostenveld, I will look into making the appropriate corrections. I want to provide some clarifying information about annotations in the Face13 example dataset so that the discussion about how to best store the different types of annotations (component, channel, binary time and continuous time annotations) in BIDS can continue.

The sub-001_task-faceFO_annotations.tsv file contains the annotations that were produced by the EEG-IP-L pipeline. The sub-001_task-faceFO_desc-qc_annotations.tsv file contains annotations after the manual quality control (QC) procedure has been completed. During the QC procedure, the reviewer can modify some time and component annotations (particularly the ‘manual’ mark) based on visual inspection of the data.

The formatting of our current annotations.tsv files (that contains component, channel, and binary time annotations in one file) are based on a combination of Examples 2 and 3 in Section 5.1: Sidecar TSV Document in the BEP 021 google doc.

I have a few concerns about storing the chan, comp, and time annotations in separate files. One concern is that this will result in a large number of annotation files considering there would also be multiple versions of each file (one for the EEG-IP-L pipeline output and at least one for the QC’ed (desc-qc) data). Another concern is the naming of these annotation files. Currently we use desc-qc to indicate if the annotations are associated with QC’ed data, but would this complicate naming the different types of annotation files with desc-chan, desc-comp and desc-task?

The .mat file contains the continuous time annotations (such as the AMICA log likelihood). Since the .mat file is not a part of the BIDS specification, this current example has added a recording-marks_annotation.tsv.gz and an accompanying recording-marks_annotation.json for continuous time annotations. The recording-marks_annotation.tsv.gz and the .json file were created based on the BIDS spec for storing physiological and other continuous recordings. The recording-marks_annotation.tsv.gz in the Face13 example contains 100 rows for each of the annotations listed in the recording-marks_annotation.json. These new files were created because the continuous time annotations can not be stored in the same way the component, channel, and binary time annotations are currently stored.

Hopefully this example can help move the discussion on how to store annotations (particularly continuous time annotations) in BIDS forward.

Thanks, Sara

smakeig commented 2 years ago

Sara -

I wonder if it would be productive to call what you refer to as 'continuous time annotations' as, rather, 'continuous time data measures' - you give the example of AMICA model likelihoods; other measures could include RMS amplitude, "theta/beta ratio", etc. (any of which might be used in some data quality, cleaning, or evaluation pipeline). In other words, I 'd suggest treating the AMICA likelihood index as a derived data channel time sync'ed with the original data channels - reserving the term 'annotation' for text or numeric markers of facts pertaining either to the whole run (as with basic metadata) or to some portion of it (as with event annotations).

Scott

On Tue, Mar 22, 2022 at 6:52 PM Sara Stephenson @.***> wrote:

Thank you for all your comments about the Face13 example @robertoostenveld https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_robertoostenveld&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=cC2TsLzL9KCPRV5UYYh9_xtdEiXYHV3qmLSg8Pf3kgwQciIi1h1TgiIXQz5W612D&s=RwFiAbQuvIn9iB_IeUl19NBp7OtEj8oUwlSAyaawuOs&e=, I will look into making the appropriate corrections. I want to provide some clarifying information about annotations in the Face13 example dataset so that the discussion about how to best store the different types of annotations (component, channel, binary time and continuous time annotations) in BIDS can continue.

The sub-001_task-faceFO_annotations.tsv file contains the annotations that were produced by the EEG-IP-L pipeline https://urldefense.proofpoint.com/v2/url?u=https-3A__www.sciencedirect.com_science_article_pii_S0165027020303848&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=cC2TsLzL9KCPRV5UYYh9_xtdEiXYHV3qmLSg8Pf3kgwQciIi1h1TgiIXQz5W612D&s=Z8Ogw_P1AhwacLF3_0aijaH89R_Kak1Rm3KGWU-QhVc&e=. The sub-001_task-faceFO_desc-qc_annotations.tsv file contains annotations after the manual quality control (QC) procedure has been completed. During the QC procedure, the reviewer can modify some time and component annotations (particularly the ‘manual’ mark) based on visual inspection of the data.

The formatting of our current annotations.tsv files (that contains component, channel, and binary time annotations in one file) are based on a combination of Examples 2 and 3 in Section 5.1: Sidecar TSV Document in the BEP 021 google doc https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1PmcVs7vg7Th-2DcGC-2DUrX8rAhKUHIzOI-2DuIOh69-5Fmvdlw_edit-23heading-3Dh.begtazq5lz86&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=cC2TsLzL9KCPRV5UYYh9_xtdEiXYHV3qmLSg8Pf3kgwQciIi1h1TgiIXQz5W612D&s=3KYY2QUgq1Zd0hGMFGsUe3avKxWr4-ryJLpjKToA6_Q&e= .

I have a few concerns about storing the chan, comp, and time annotations in separate files. One concern is that this will result in a large number of annotation files considering there would also be multiple versions of each file (one for the EEG-IP-L pipeline output and at least one for the QC’ed (desc-qc) data). Another concern is the naming of these annotation files. Currently we use desc-qc to indicate if the annotations are associated with QC’ed data, but would this complicate naming the different types of annotation files with desc-chan, desc-comp and desc-task?

The .mat file contains the continuous time annotations (such as the AMICA log likelihood). Since the .mat file is not a part of the BIDS specification, this current example has added a recording-marks_annotation.tsv.gz and an accompanying recording-marks_annotation.json for continuous time annotations. The recording-marks_annotation.tsv.gz and the .json file were created based on the BIDS spec for storing physiological and other continuous recordings https://urldefense.proofpoint.com/v2/url?u=https-3A__bids-2Dspecification.readthedocs.io_en_stable_04-2Dmodality-2Dspecific-2Dfiles_06-2Dphysiological-2Dand-2Dother-2Dcontinuous-2Drecordings.html&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=cC2TsLzL9KCPRV5UYYh9_xtdEiXYHV3qmLSg8Pf3kgwQciIi1h1TgiIXQz5W612D&s=DW3z-6NI2tAjm_R3O1YbI8UCbzBZvoqJf1t53ixqp-I&e=. The recording-marks_annotation.tsv.gz in the Face13 example contains 100 rows for each of the annotations listed in the recording-marks_annotation.json. These new files were created because the continuous time annotations can not be stored in the same way the component, channel, and binary time annotations are currently stored.

Hopefully this example can help move the discussion on how to store annotations (particularly continuous time annotations) in BIDS forward.

Thanks, Sara

— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_bids-2Dstandard_bep021_issues_5-23issuecomment-2D1075728495&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=cC2TsLzL9KCPRV5UYYh9_xtdEiXYHV3qmLSg8Pf3kgwQciIi1h1TgiIXQz5W612D&s=9wZxec_T6vXmANINfG5W1rPzwqMGqVARU9GbI5rt0MU&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AKN2SFWI5H7Z7SSSWGOM5MDVBJFKHANCNFSM4XUNFFSA&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=cC2TsLzL9KCPRV5UYYh9_xtdEiXYHV3qmLSg8Pf3kgwQciIi1h1TgiIXQz5W612D&s=dYo2eHv8Ui2wQ-n7YSY-F7g2cg0ms3z_m85ItIbam1M&e= . You are receiving this because you were mentioned.Message ID: @.***>

robertoostenveld commented 2 years ago

I'd suggest treating the AMICA likelihood index as a derived data channel time sync'ed with the original data channels

Continuous data that is time synched with other data is already part of the BIDS specification and documented here https://bids-specification.readthedocs.io/en/stable/04-modality-specific-files/06-physiological-and-other-continuous-recordings.html. A very similar approach (again with TSV files) is used for PET blood recording data.

arnodelorme commented 2 years ago

Thanks Robert,

Do you know of an EEG dataset (joint EEG, eye-tracking, or accelerometer etc…) that you know of that uses this synchronization scheme?

Cheers,

Arno

On Mar 25, 2022, at 2:28 AM, Robert Oostenveld @.***> wrote:

I'd suggest treating the AMICA likelihood index as a derived data channel time sync'ed with the original data channels

Continuous data that is time synched with other data is already part of the BIDS specification and documented here https://bids-specification.readthedocs.io/en/stable/04-modality-specific-files/06-physiological-and-other-continuous-recordings.html. A very similar approach (again with TSV files) is used for PET blood recording data.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

jesscall commented 2 years ago

Hi BEP021 community, I'm looking to move forward on a few points from @SaraStephenson 's thread above:

The formatting of our current annotations.tsv files (that contains component, channel, and binary time annotations in one file) are based on a combination of Examples 2 and 3 in Section 5.1: Sidecar TSV Document in the BEP 021 google doc.

[...] To store them in separate files, a few questions/concerns pop up: a. This creates a large number of annotation files - multiplied by pipeline (EEG-IP-L) and QC outputs (desc-qc). b. Naming all these files - Currently we find it Very useful to use desc-qc to indicate the annotations are associated with QC'ed data, but would this be possible with the convention of desc-chan, desc-comp and desc-task?

@robertoostenveld What are your thoughts on this? Should this be revisited?

We'd like to avoid unnecessary complexities in file naming, and Sara's example follows examples 2 and 3 of BEP021 5.1: Sidecar TSV.

...

Second,

I wonder if it would be productive to call what you refer to as 'continuous time annotations' as, rather, 'continuous time data measures' -

@smakeig thank you -- calling them "measures" rather than annotations can address this nicely, and works with @robertoostenveld and @CPernet's prior comments on storing in TSV files as continuous recordings. see spec here in previous comments.

@SaraStephenson, I think we'll try moving away from "annotations" - perhaps recording-marks_amica.tsv.gz could work?

dorahermes commented 2 years ago

I agree with differentiating continuous time measurements versus event based annotations with an onset and duration.

@tpatpa and I are making some updates to the examples for event based annotations with an onset and duration that use HED/SCORE tags.