Closed KirstieJane closed 4 years ago
I guess that really depends on what you mean by reproducible and reliable.
If you mean reproducible strictly in the sense that we will get the same results from one call to the next, then we're already there. Switching over to the v2.5 selection criteria with #119 will not impact that.
If by reliable you mean that the results are reasonable and have validity, then we have the same problems that the v2.5 selection criteria did. We can (and should !) do more testing, here, but the question of person hours will probably break down along a few lines:
This is assuming that we want to improve on the previously published meica
denoising ! If we're ok to just re-implement it, then we have much less work to do.
The point I want to get to is an understandable algorithm where each step makes sense and it can process the data in a way that's similar to v2.5. Pretty much, the point were we can comfortably suggest that people could consider using this code over the original v2.5 code (and where new contributors don't need to worry about accidentally playing around with a brittle section of the algorithm). I'm not including improved modularization, more changes in I/O, and provenience. My estimate is that it would take around 100 person hours of work to cut & paste v2.5 back in, comment the code, and make sure different sections aren't interacting in detrimental ways. That amount of time is not overwhelming, but not trivial.
If we're just talking about re-incorporating v2.5, then I think we're a lot closer than 100 hours. We need to address a PR I've opened onto @emdupre's branch, emdupre/tedana#17, but after that there will only be a couple of things we'll need to deal with before merging #119.
In terms of running the pipeline, these changes will "work", but we'll need to check the results to make sure that they're what we expect. I don't know how many person hours that will require, or what steps we'll want to take for that.
@handwerkerd Do you have the task timing file for the 5-echo run you shared with us? I think that a good sanity check for us would be to define a native-space V1 ROI (it seems like that's the region with the most block-y time series, based on the ICA component weight for the component that looks most like a task) and to plot the mean time series at each stage of denoising, sort of like I do in the processing pipeline details page of the docs.
I know it's not the optimal solution, which is why we have that tedana-comparison repository for more comprehensive comparisons and checks, but it'll serve as a quick check of the denoising steps to make sure that neuronal signal is not being lost, which I think might serve as a good holistic marker of denoising reliability.
The task has a flashing checkerboard that is 20s on and 40s off X 5. TR=2sec There are 160 volumes in the time series (320 seconds). The first stimulus should appear on volume 6 (The first 5 volumes are pre-task). If the first volume has a timing of 0 seconds, then the stimuli appear at 10, 70, 130, 190, and 250 seconds. For what it's worth, I wouldn't be surprised if the main effect in V1 is sometimes removed as midK. This is was issue with the original v2.5 algorithm, which was corrected by getting rid of a criterion that midK-ed components primarily because they had high variance explained.
Also using the data in this way is one of the reasons why this data set was acquired. (See: https://fim.nimh.nih.gov/presentations/effects-multi-echo-based-denoising-reliability-massively-repeated-block-design-task ). That was all 3 echo runs, but I collected a few 5 echo runs for each of the volunteers.
Thanks! What do you think of something like this:
This was produced from a basic run (all defaults) on the 5-echo data with the current version of tedana. I think that the trend we're seeing in the OC time series might actually be introduced by the GSR that's run by default. When I run it again using the same settings, but without GSR, we get this:
Not that that's the point of this issue. Just thought it was interesting.
I'm not sure quite what I'm supposed to get from the shading in the plots. The range of values across voxels isn't that meaningful. It's good to see the mean plots, but a fit & residual for the task design would give a measure of whether or not fit quality changed much.
I was hoping that the confidence bands would shrink from TE3 to OC to DN, but that didn't pan out.
I'll take a crack at running the model and getting the residuals.
I might be a bit confused. You have a single run and multiple voxels in that run. Are the confidence bands the variation across voxels in this case? If so, I don't see any reason to predict more similarly across brain locations with better denoising. Much of that variation would be from spatial neural or neurovascular variation, not scanner noise.
I think that's probably true, although I think the extreme similarity of the error bands is in part due to how I normalized the time series. When I do it by voxel, and use the standard deviation for the error bands, some differences come out. At points where the mean is pretty much the same for all of the derivatives, the DN time series generally has a smaller error band.
Regardless, I'll focus on the model fit approach. It shouldn't be too hard to do with nistats.
Sorry for being behind, here ! @tsalo and @handwerkerd: do you think that this analysis will give us a good index of whether denoising is working reliably ?
I know that the data was originally collected with something like that in mind, but I'm not sure if this is what we want when we say "reliable" and "reproducible" (since it looks like we never finished having that conversation !). I imagine this will give us a good index of whether we think denoising is working "well", but that might be distinct from our reliability and reproducibility goals.
It'd be great to have all of the above of course, but I just want to make sure we're all on the same page !!
I guess you're right in that it's not a measure of reliability, and we should probably focus on how "well" tedana works in the tedana-comparison project/publication.
I suppose a good test of reliability across random seeds might be to come up with a simple analysis we want to run, like this beta comparison or a simple first-level model, and then to run the pipeline a bunch of times (maybe ~100?) and to look at the variability of the findings across the random seeds. Or, an even simpler approach might be to run the pipeline with the different seeds and then to just correlate the denoised maps to produce a correlation coefficient distribution. Are either of these more in line with what you were thinking?
@cjl2007 has done some incredible work with evaluating the selection components in #181 that I would like to draw more attention to. I really think of this in two flavors - the first is the timeseries of each component compared to motion parameters. I think this is great, and I know we had discussed it in the past. This could be extended to compare the timeseries to polynomials as well (I see low freq every once in a while get marked as good), or a tasks model.
The second awesome bit is comparing standard maps of networks, CSF, WM, GM, artifacts (not sure about that) to the spatial maps corresponding to the timeseries. Their also exist reasonable maps of vasculature in MNI space that could be used here. I think there is also an argument to be made to include measures like mirror symmetry,
Now I know the whole idea is to avoid these sorts of approaches to find good and bad comps, but it is clear that the ICA approach hasn't solved the signal vs noise problem. I think something like this would help us get a hand on what is good bad and ugly, before cutting the decision tree down and also help evaluate denoising success, in addition to SNR, model fits, seed based connectivity, etc.
Ideally, we could add a utility (tedXtra? tedana++?) that would take the output folder and a variable number of arguments (--motion, --csf, --veins, etc) and then a) add these to the comp table and b) produce figures that attempt to be as informative as those @cjl2007 made in #181 .
I have just been taking notes on this, and got excited when I saw that someone had done all the hard work and figure creation that ( I think) justifies it.
@dowdlelt I totally agree that @cjl2007's work will be hugely beneficial to tedana in multiple ways. We can supplement our current metrics with the ones supplied by @cjl2007. Some are quite similar to AROMA, which is awesome. I spent some time trying to figure out how best to incorporate AROMA metrics into tedana in #217 and on my aroma branch, but ran into some difficulties. Given that these new metrics seem to work well, perhaps we should go straight to them instead.
However, even before we try incorporating new metrics into tedana, I think it would be a great idea to use them to evaluate our current decision tree. To do that, we'll need to run these metrics after tedana so we can see which steps are consistently problematic. What if we translated the MATLAB code as a separate wrapper in a separate repository for now? That way we can run the two workflows sequentially, the way @cjl2007 is doing now.
One difficult thing is that I think we need to use/estimate native space versions of everything, since tedana should be run on native space data. AROMA works by automatically grabbing transform files in the FEAT directory to convert the native space data to standard space. We can't do the same thing within tedana, but we can do it in conjunction with afni_proc and fMRIPrep for the reliability and/or validation analyses.
@tsalo, 100% agreement! perhaps I wasn't clear in my excitement, but I am all about first using them to examine the current decision tree.
I'm still unclear on why tedana requires native space data, as long as every echo is treated identically (as with motion correction being estimated from a single echo) - are there still spatial metrics that assume minimal transformations? In any case, it seems that it would be much easier, for this testing, to warp the output maps to native space, rather than all the input echoes.
hi @tsalo, @dowdlelt - Thank you for the positive feedback. I would be happy to share the code that I have written so far. I just need a few days, I think, to clean up the Matlab code a bit, so that it will be easier for others to read/use. I will plan to send it (along with some example data it can run on) to one of you via a secure file transfer. Just need the best email address to send it to!
Re: the issue of needing to have all the data in native space, to clarify in case it wasn't clear in my earlier messages... all the spatial templates I am currently are in native image space. For example, the network templates were originally in Freesurfer fs_LR space, which can be brought into native functional space via a series of inverse transforms (ie; fs_LR --> anatomical --> native functional). The artifact templates are created using the subject's anatomical images. Of course, atlas space artifact templates (like AROMA uses) warped into native image space would probably work well too.
@cjl2007 have you had a chance to clean up the MATLAB code? Also, would you be willing to post it to a GitHub repository or would you prefer to send the files directly?
@tsalo - Yes, and uploading to GitHub sounds like a good idea.
@tsalo - I think I was able to send you a link to the code I uploaded. Let me know if you have any questions ... I did not have time to test the .sh file, which includes some steps for generating the different masks/templates the .m code uses... so I would not be surprised if there is a bug or two (or three). The .m file though will give you a general sense of the steps I was taking to tweak the component classifications, which is probably of most interest.
I'm a little bit out of the loop on how things compare between now and when these original comments were made, can @handwerkerd @emdupre or @tsalo comment on that? Just really briefly, like a sentence, so I know how this issue fits in with the others.
It's been a long time for me, but I believe that this thread moved into a discussion of post-tedana component classification tools (namely one by @cjl2007) that could be used to evaluate tedana's classification performance in the validation analysis. The discussion this issue became about is similar in nature to #217, @smoia's interest in post-tedana workflows, and the proposed validation analysis (https://github.com/ME-ICA/tedana-comparison).
However, I think both the resulting thread and the original issue have been addressed fairly well, and we can close this. We have a repository for the validation and reliability analyses, although neither one has been worked on much. We also have worked to make tedana's outputs as BIDS Derivatives-compatible as possible. This should allow users to apply additional component classification algorithms (e.g., AROMA or post-tedana workflows) to the same decomposition as tedana very easily.
Should we close this?
Probably, most of the discussion here seem superseded by later issues or discussions.
I've just had lunch at SFN with @handwerkerd (👋 hi from San Diego 🌞).
One of the goals we would both like to meet (along with, I suspect, all
tedana
developers) is a denoising algorithm that we can trust.In our developer call last week we decided to focus on the MEICA v2 selection criteria, which is simpler than the v3 criteria.
One of the questions that neither of us had an answer to is: How long will it take to get to a point that
tedana
runs denoising in a reproducible and reliable way. The context for this question is thinking about the long term planning of the project ⏰@emdupre, @tsalo & @rmarkello - you know the implementation of the algorithm in
tedana
best - what's your estimation of person hours required to get to this point?Is there anything else that we'd need to achieve the goal?