Open LeonieBorne opened 4 years ago
I will also start to look at some data. One idea that I had is the ABIDE dataset. It is an autism data set but comes with some behavioural and questionnaire data (as far as I know -> I will check)
The ABIDE dataset would be great! But we have to check that it is openaccess so that there is no ethics problem of using it here.
Before finding the perfect dataset, I think we can use a dummy one to start writing the tutorials, like the one they are using here:
import numpy as np
n = 500
# 2 latents vars:
l1 = np.random.normal(size=n)
l2 = np.random.normal(size=n)
latents = np.array([l1, l1, l2, l2]).T
X = latents + np.random.normal(size=4 * n).reshape((n, 4))
Y = latents + np.random.normal(size=4 * n).reshape((n, 4))
Here are some propositions from the Brainhack community for real datasets:
However, as @htwangtw points out, we should probably use a simulated dataset to start faster as the OHBM Hackathon only lasts 3 days!
One thing to keep in mind - some neuroimaging dataset can have different licenses for the imaging data and the phenotype data. Most of the imaging data are publicly accessible but restricted licenses are applied to phenotype information. Releasing them in examples tutorials might not be the quickest thing to do for Brainhack. At the EMEA hub, we propose to resolve this issue and maximize people's experience at brainhack in the following manner
nilearn
to write about data processingI will comment on different issues for the details of the relevant parts.
One thing to keep in mind - some neuroimaging dataset can have different licenses for the imaging data and the phenotype data. Most of the imaging data are publicly accessible but restricted licenses are applied to phenotype information. Releasing them in examples tutorials might not be the quickest thing to do for Brainhack. At the EMEA hub, we propose to resolve this issue and maximize people's experience at brainhack in the following manner
- Use neuroimaging data from
nilearn
to write about data processing- Use simulated data to write algorithm implementation. I don't mind people just take the code from here: https://github.com/htwangtw/cca_primer/blob/master/cca_notebook.ipynb
I will comment on different issues for the details of the relevant parts.
Hi,
I wanted to refresh my python skills and understand a bit the simulated data, so I ended up writing a function to generate two datasets with common latent variables (based on Hao-Ting's tutorial from her previous message), but we can simulate bigger datasets with more variables per dimension and more latent variables (perhaps useless but why not). I haven't really tested it thoroughly, but if anybody wants to play with it here it is (I still need to figure out how to fork and push so I leave a link).
https://github.com/diiobo/random/blob/master/Generate%20simulated%20data.ipynb
That's cool! Good idea to write a function so that number of observations, variables and components can be chosen flexibly. I think, it doesn't provide the stated hidden structure (X: [l1, l1, l1, l2, l2, l2, (l2)] and Y: [l1, l2, l1, l2, l1, l2, (l2)]), though, but I would need to look further into that. It would be also cool to have a function which allows you to flexibly choose the latent structure!
That's cool! Good idea to write a function so that number of observations, variables and components can be chosen flexibly. I think, it doesn't provide the stated hidden structure (X: [l1, l1, l1, l2, l2, l2, (l2)] and Y: [l1, l2, l1, l2, l1, l2, (l2)]), though, but I would need to look further into that. It would be also cool to have a function which allows you to flexibly choose the latent structure!
@nadinespy It works for me if the number of (observed) variables are multiples of the number of latent variables (lx and ly are the latent structures, X and Y are the datasets with added noise to make observed variables). Tho indeed it starts to look weird when they are not multiples, because then I just replicated the last column to fill in the matrix until the corresponding number of variables is reached, and if the reminder is >2 then even more! But I also wonder whether it really matters the underlying structure or one could also just randomly assign the positions of the latent variables in the datasets (?)
I created an ipython notebook where I simulate three different models to run a cca and plot it (waiting for the pull request to be accepted).
In order to write the different tutorials, we need open access databases to play with. Feel free to suggest here if you have any ideas, or to start looking for one on OpenNeuro!