bids-standard / pybids

Python tools for querying and manipulating BIDS datasets.
https://bids-standard.github.io/pybids/
MIT License
220 stars 122 forks source link

Possible module for generating data acquisition report #99

Closed tsalo closed 6 years ago

tsalo commented 6 years ago

I brought this up in the pybids channel on the Brainhack Slack team, and it seemed like something that might be worth incorporating into pybids (e.g., @tyarkoni recommended adding a top-level reporting module). The idea is to use the information in the BIDS dataset’s json and nifti files to write up the data acquisition portion of a methods section. I wrote something to do this a while ago (here are the functions and a notebook with an example), but it isn’t very flexible and I’ve only tested it on one dataset. Anyway, I’ve been working on this in my fork, but wanted to discuss the idea (and possible implementation) here in more detail.

chrisgorgo commented 6 years ago

This looks very cool! You should definitely make a pull request.

tyarkoni commented 6 years ago

Agreed, this is great! My main suggestion is to maybe add a class that wraps all of the functionality you currently have in your for-loop and makes intelligent guesses about what the user wants (potentially with the help of some boolean arguments). Something like the following would be nice, from a UI standpoint:

>>> layout = BIDSLayout('/path/to/project')
>>> reporter = BIDSReport(layout)
>>> reporter.generate()
"MR data were acquired using a 3-Tesla Siemens Prisma MRI scanner. ..."

It would also be nice to capture **kwargs in the generating function and use them to subset all found reportable objects. I.e., by default, you'd get back a report for every valid file detected in the BIDSLayout, but you could do things like modality='func|anat', in which case only the 'func' and 'anat' blocks would get executed, and other modalities like 'dwi' or 'fmap' would be ignored.

Anyway, just some suggestions; I think it would be fine to merge an initial version pretty much as-is. But we should figure out what to call the module--personally I'd probably go with bids.reports. On the assumption that at some point down the line we might want to add reporting tools for methods other than (f)MRI, it might make sense to import all of the domain-specific stuff out of bids.reports.mri though.

BenInglis commented 6 years ago

Hi Taylor, I was sent this way by Chris. Can I ask questions & offer feedback on the example notebook you posted? I only have a Trio but I have looked at a lot of Prisma protocols already, and tinkered on a couple of systems, so I have a pretty good idea what's involved on the scanner end. I would suggest starting with a relatively standard protocol, such as the Prisma protocol from the Human Connectome Project. It is already the basis of ADNI-3, for example, and others (like UK Biobank) are leaning heavily on it. Can we start there?

drmowinckels commented 6 years ago

I love this idea so much! Just a question, would this assume that everything within a dataset has the exact same acquisition parameters?

This is just a thought, and comes from the fact that we are treating what initially were multiple datasets acquired over years and different scanners (longitudinally, and we have upgraded scanners and parameters). Would this feature detect that there are multiple acquisition parameters and then report an error, or are you thinking it could manage to report several different acquisition set-ups?

This is likely not a large issue maybe, and may be very specific to our data.

Remi-Gau commented 6 years ago

Love the idea. With that kind of idea in mind, I recently suggested on the BIDS mailing list to make BIDS more COBIDAS compatible so as to have methods reporting that are up to the standard the field should aim for. Started (very very early) creating a doc to keep track of what exist in BIDS that corresponds to the COBIDAS requirement and what is still missing. https://docs.google.com/spreadsheets/d/1-4mNtC1NsnV3NRpc24kRdgYyIs4a0OtRwZN_0W5VSJQ/edit#gid=0

tsalo commented 6 years ago

@tyarkoni I like that idea. I will open a PR (thanks @chrisfilo for the push) once I've tested out the BIDSReport approach.

@BenInglis Any feedback or questions would be greatly appreciated. As for starting with a well-known, standard protocol- I like that, but are the data readily available in BIDS format? At least for right now, I was planning to download a couple of OpenfMRI datasets and test on them. It would be nice to be able to test on something with a lot of documentation, but at least most of the OpenfMRI datasets are associated with papers that I can look at.

@Athanasiamo I didn't want to assume that all of the subjects/sessions had the same parameters, so I am trying to make it return all of the different protocols within a dataset. At the moment, my approach is to use a Counter on the full descriptions, so you get each unique write-up along with the number of subjects with that protocol. Of course, that approach can't distinguish between different protocols and missing data, but I figure it's better than nothing. I'd be more than happy to try doing it a different way if anyone has any other ideas.

@Remi-Gau I am not on the mailing list, so I missed that, but that's awesome! I tried to use the COBIDAS paper and the BIDS specification when I was writing the reporting code, but it will be really helpful to have a summary document to reference. I'm also less than confident about how I handle some things (e.g., slice orders for functional data, reporting sequences and sequence variants), and maybe your document can help with that.

BenInglis commented 6 years ago

@tsalo I'm not a BIDS person (or even an fMRI person, since I run a facility and help folks get their acquisition correct) so I suggest we first reach consensus on the most appropriate protocols to work on as our test case(s). Are you able to identify & grab OpenfMRI data sets that used either the HCP or an HCP-like protocol, run on a Siemens Prisma with VE11C software and a 64ch receive coil? @chrisfilo may be able to help locate.

I suggest Prisma and 64ch coil to ensure we deal with all the current sequences and parameter options. Whatever we do for this combo will have utility for a couple of years at least. Filling holes for older Siemens scanners & software, other vendors and missing scan types should then be a tractable process.

FWIW, I have actually been building a massive comparison protocol that includes (near) HCP, ADNI-3, UKBiobank and Western Uni (Ravi Menon's) TBI protocol. This is for use on my aging Trio running VB17A, but the point is that I am now becoming familiar with the parameter choices and some of the trickier options to include, e.g. elliptical k-space for the T2-FLAIR in the UK Biobank protocol.

@Remi-Gau Thanks. Always easier & quicker to edit something than start with a blank page! As a non-BIDS person, how do we try to reconcile @tsalo's draft module with your spreadsheet? If @tsalo is able to find a suitable HCP data set and run his module on it, I can then go through the notebook it produces and make sure it makes sense from a physics perspective. (I already have questions on the meaning of segmented k-space in Taylor's first example. It may not fit the usual definition we use.) Then what?

chrisgorgo commented 6 years ago

Hi everyone, Great to see more people involved! I believe it would be best if we try to organize this work a little bit. I personally feel that there are two independent projects here:

1) Tool to generate paper snippets given an dataset described using the current version of the BIDS specification. This is the work @tsalo started and this is I believe the best forum to discuss it since the plan is to integrate this code into pybids. For clarity I would restrict this work to current version of BIDS. Using OpenfMRI examples seems like a great idea (the newer ones should have more metadata). Feedback on the readability and accuracy of the text created by the tool would be very important (thanks @Athanasiamo!).

2) Comparing current BIDS specification with COBIDAS and COBIDASAcq to identify missing fields and fields that should be RECOMMENDED or REQUIRED (the latter would have to wait for BIDS 2.0). This work started by @Remi-Gau seems to be better suited for the mailing list since this way it will reach more people. Feedback from @BenInglis and other physicists would be essential.

To answer the specific question - I don't know of any OpenfMRI dataset that used the exact same protocol as HCP, but Midnight Scan Club might be worth looking at instead: https://openfmri.org/dataset/ds000224/

If you run into issues with the datasets don't hesitate to report them at openfmri.org

I hope this helps!

BenInglis commented 6 years ago

Thanks @chrisfilo. Now I understand. For now I'd like to focus on the first item; help @tsalo get his module working. There seem to be two parts to this, now that I look at it more closely. There is the fMRI methods reporting, e.g. the task details, and then there is the acquisition reporting. I am an amateur on the former, so someone else should definitely join this effort! I can, however, handle the latter for Siemens scanners. I think the Midnight Scan Club data sets might be the perfect way to test the fMRI task reporting, based on sheer variety. We need to find other protocols to get the acquisition reporting correct though.

A thought. Should we continue ad hoc for the protocols we use to set the reporting up with? We can certainly beta test with wild type data sets, but does it not make more sense to use predefined data in the first instance? It's not hard for me to generate a ton of fMRI and anatomical scan data to be used for initial setup and testing. (They could go into OpenfMRI in a test/reference data section, perhaps?) Then we have these as a reference in case something is found to be incorrect later on, and others can easily start from the same point. I can also make documentation from scratch, rather than reverse engineering published data.

@Remi-Gau Let me ponder your spreadsheet for a while. I need to get up to speed on BIDS specs. I'd prefer to do this organically, as I help out with the BIDS-compatible reporting. I want to wait in part because I am currently trying to digest new to me nomenclature on some PCASL sequences, and I am already aware that we need to quickly converge on standardized terms or things will get very messy indeed. (It's already bad enough we have MB and SMS used for the same thing. Wait until you see what's in the PCASL treasure chest!)

tsalo commented 6 years ago

@BenInglis I haven’t been able to find any public datasets that work perfectly so far. Even on OpenfMRI, there’s a lot of heterogeneity between datasets. I haven’t found any that used the HCP protocol, though I might have missed it. I like the idea of using a manufactured dataset when developing and testing the code.

I think it’s possible to describe the tasks from the events files and metadata, but I haven’t tried to figure that out yet.

I’ve opened a pull request (#100) to make it easier for everyone to look through the code in its current state. It’s a work in progress, but it does work (to some extent) with a number of OpenfMRI datasets (224, 229, 233, 245, and 253), though not on others (237 and 241). Unfortunately, the only one with field maps was 224 (the Midnight Scan Club dataset), and I don’t think any of them had DWI data.

BenInglis commented 6 years ago

@tsalo I'm not a BIDS, OpenfMRI or any sort of coding or fMRI person, so can I rely on you to direct me? My expertise ends abruptly when the data leave the scanner. What would therefore work best for me would be this work flow:

  1. I devise three protocols, to include as many pulse sequences & variants as can be covered reasonably. Likely based on: (a) HCP, (b) a synthesis of scans I see often in other common protocols, and (c) a composite of "advanced" or "rare" scans. In each case I'll try to include as many features that might appear on GE & Philips scanners, not just Siemens, e.g. use of SENSE rather than GRAPPA.

  2. Distribute the three protocols for feedback from BIDS community. Iterate as needed.

  3. I acquire all three protocols on my Siemens Trio, upload all data to OpenfMRI. (May need your help for that.)

3.1. If there's interest, I assist others in translating any of the protocols onto other scanner platforms & software versions, so they can acquire & upload further standard test data sets to OpenfMRI.

  1. You run your magic on my test data sets & I review your reports to compare against my detailed knowledge of the full protocols.

  2. Once we are happy with how your scripts report my test data (and maybe others' test data in a second wave of tests), we move on to select wild type data from OpenfMRI. By then I should have sufficient knowledge of both OpenfMRI and BIDS not to be total dead weight on these fronts.

tsalo commented 6 years ago

@BenInglis I really like that idea. Thanks! There's not much input I could provide until step 2, though maybe @chrisfilo might have some insights regarding the types of scans to include?

chrisgorgo commented 6 years ago

As much as I would love more data to be submitted to OpenfMRI I am not convinced that this project would require new data being acquired. At least not in the initial stage.

Wouldn't it be quicker/better to start with some historical data acquired in your facility (so you know all of the parameters)? Later when more precise needs arises new data could be acquired. WDYT?

BenInglis commented 6 years ago

Hi @chrisfilo, I only have control over my own test data, most of which has strange parameter settings relevant to the particular test. Rarely do I run anything that looks like a routine scan. So it's actually far quicker for me to obtain dedicated new data. The sticking point for me, whether it's new or old data, is getting it to BIDS in a format you guys are used to using. My role ends with dicom data ported off the scanner. The only time I deal with nifti is when someone sends me problem data sets to assess. (For that I use Mricron.) So there's a slightly circular problem right now: do I start with existing BIDS/OpenfMRI data and have the task of ensuring that I have the ability to check all pars, possibly in the absence of a PDF printout of the scanner protocol, or do I obtain new data for which I have complete knowledge of the acq pars, but the data are sitting in the wrong format in the wrong place? For me, it would be best to ship dicom data to someone, let you guys do whatever you want to it (json, nifti) and then run the scripts to interpret the headers. I can then "score" the interpreted reports against my gold standard info. Any concerns whatsoever, I can easily go back to the scanner and acquire new data. (This issue came up today on Twitter, the question about table repositioning. It's super easy to acquire new test data and push it through whatever test pipeline we define.) So, the question I have is: given a ball of data, where do I send it, and how?

chrisgorgo commented 6 years ago

Ok I see. I can BIDSify a small dataset if needs be (although this month things are crazy with travel so expect some delay).

BTW NKI Enhanced provide DICOMs and MR protocol reports: http://fcon_1000.projects.nitrc.org/indi/enhanced/mri_protocol.html

Best, Chris

On Wed, Jan 10, 2018 at 12:51 PM, BenInglis notifications@github.com wrote:

Hi @chrisfilo https://github.com/chrisfilo, I only have control over my own test data, most of which has strange parameter settings relevant to the particular test. Rarely do I run anything that looks like a routine scan. So it's actually far quicker for me to obtain dedicated new data. The sticking point for me, whether it's new or old data, is getting it to BIDS in a format you guys are used to using. My role ends with dicom data ported off the scanner. The only time I deal with nifti is when someone sends me problem data sets to assess. (For that I use Mricron.) So there's a slightly circular problem right now: do I start with existing BIDS/OpenfMRI data and have the task of ensuring that I have the ability to check all pars, possibly in the absence of a PDF printout of the scanner protocol, or do I obtain new data for which I have complete knowledge of the acq pars, but the data are sitting in the wrong format in the wrong place? For me, it would be best to ship dicom data to someone, let you guys do whatever you want to it (json, nifti) and then run the scripts to interpret the headers. I can then "score" the interpreted reports against my gold standard info. Any concerns whatsoever, I can easily go back to the scanner and acquire new data. (This issue came up today on Twitter, the question about table repositioning. It's super easy to acquire new test data and push it through whatever test pipeline we define.) So, the question I have is: given a ball of data, where do I send it, and how?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/INCF/pybids/issues/99#issuecomment-356681811, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOkpww1SIF83GbnIkjloRCc-J-Lk6eQks5tJPiQgaJpZM4RTSVo .

BenInglis commented 6 years ago

Hi @chrisfilo, these would be excellent beta test data sets, but I would prefer to start with data acquired with the most recent version of CMRR's MB sequences. There are several new parameters in the most recent version (R016a) compared to whatever was used by NKI. (I could probably guess which version they used based on the missing parameters and cross-referencing to the MB release notes, but the actual sequence revision isn't included in the protocol or the PDF.)

One other thing, for you and @tsalo, and anyone else who might be interested. I can acquire these test data using our custom printed head restraint. Including pulse oximetry and chest motion is easy. (I also have expired CO2 working but not yet sure how robust that is. Also working on real time blood pressure, but no ETA on that.) So it would be trivial to make these data do a few things simultaneously. We do plan on uploading to OpenfMRI a whole set of restrained head data, plus physio, but that is going to take a few months. We could consider these first tests as a down payment on a more extensive data set to come.

tsalo commented 6 years ago

@BenInglis @chrisfilo Sorry for the delay in responding. If either of you can provide a BIDS dataset with known parameters and a range of scan types, that would be great. However, even without those kinds of complete datasets, I can continue to work on the module.

Currently, I am trying to figure out a better way to handle missing data or heterogeneous datasets. At the moment, my strategy is to generate a large formatted string for each subject that covers all sessions. Then, the code uses a Counter to group subject-specific writeups into equivalent strings. The logic there was that, in the case of missing data, the most common string would probably reflect complete data for a subject, and the other strings would correspond to random missing data and could be ignored. That approach doesn't work so well with heterogeneous datasets where different subjects get different protocols (which I know was of interest to @Athanasiamo). I'm sure that users can work with a Counter if necessary, but I'm curious to know if anyone has any ideas for a better way to do it.

chrisgorgo commented 6 years ago

On top of the examples I already mentioned in this thread I would also look at recent datasets in OpenfMRI (but only focus on those that have papers linked to them) as well as MyConnectome dataset.

Best, Chris

On Sun, Jan 28, 2018 at 10:05 AM, Taylor Salo notifications@github.com wrote:

@BenInglis https://github.com/beninglis @chrisfilo https://github.com/chrisfilo Sorry for the delay in responding. If either of you can provide a BIDS dataset with known parameters and a range of scan types, that would be great. However, even without those kinds of complete datasets, I can continue to work on the module.

Currently, I am trying to figure out a better way to handle missing data or heterogeneous datasets. At the moment, my strategy is to generate a large formatted string for each subject that covers all sessions. Then, the code uses a Counter to group subject-specific writeups into equivalent strings. The logic there was that, in the case of missing data, the most common string would probably reflect complete data for a subject, and the other strings would correspond to random missing data and could be ignored. That approach doesn't work so well with heterogeneous datasets where different subjects get different protocols (which I know was of interest to @Athanasiamo https://github.com/athanasiamo). I'm sure that users can work with a Counter if necessary, but I'm curious to know if anyone has any ideas for a better way to do it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/INCF/pybids/issues/99#issuecomment-361082559, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOkp3Wk_2_33kDczTme_bly9utxLD3tks5tPLbpgaJpZM4RTSVo .

tsalo commented 6 years ago

I downloaded a number of datasets and figured I should share the results. What looks okay to me might not look good to others. I wasn't sure if I should add the files to a branch or not, so I made a Drive folder with a notebook and the output files: https://drive.google.com/drive/folders/13Cd4UimmpXpT0BMX4qPMH3oBpGbO0KvF?usp=sharing

I will check each of the datasets for papers and start looking at the original methods sections as well, but at least now anyone can take a look at the outputs without having to download datasets and clone my fork of pybids.

I'll make sure to download and test on MyConnectome as well.

chrisgorgo commented 6 years ago

We should not let the strive for perfection stop us from doing something useful. Your tool will be one of a kind if it produces even partial data acqusition description. We can modify the validator in the future to encourage users to record more metadata to make those descriptions better in the future.

Best, Chris

On Sun, Jan 28, 2018 at 11:01 AM, Taylor Salo notifications@github.com wrote:

I downloaded a number of datasets and figured I should share the results. What looks okay to me might not look good to others. I wasn't sure if I should add the files to a branch or not, so I made a Drive folder with a notebook and the output files: https://drive.google.com/drive/folders/ 13Cd4UimmpXpT0BMX4qPMH3oBpGbO0KvF?usp=sharing

I will check each of the datasets for papers and start looking at the original methods sections as well, but at least now anyone can take a look at the outputs without having to download datasets and clone my fork of pybids.

I'll make sure to download and test on MyConnectome as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/INCF/pybids/issues/99#issuecomment-361086641, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOkp3u90yMPEsIVJcrIpyilN0ZT884Zks5tPMP4gaJpZM4RTSVo .

tyarkoni commented 6 years ago

Merged in #100.