Databases to showcase - Githubissues

LeonieBorne commented 4 years ago

In order to write the different tutorials, we need open access databases to play with. Feel free to suggest here if you have any ideas, or to start looking for one on OpenNeuro!

likeajumprope commented 4 years ago

I will also start to look at some data. One idea that I had is the ABIDE dataset. It is an autism data set but comes with some behavioural and questionnaire data (as far as I know -> I will check)

LeonieBorne commented 4 years ago

The ABIDE dataset would be great! But we have to check that it is openaccess so that there is no ethics problem of using it here.

Before finding the perfect dataset, I think we can use a dummy one to start writing the tutorials, like the one they are using here:

import numpy as np

n = 500
# 2 latents vars:
l1 = np.random.normal(size=n)
l2 = np.random.normal(size=n)

latents = np.array([l1, l1, l2, l2]).T
X = latents + np.random.normal(size=4 * n).reshape((n, 4))
Y = latents + np.random.normal(size=4 * n).reshape((n, 4))

LeonieBorne commented 4 years ago

Here are some propositions from the Brainhack community for real datasets:

However, as @htwangtw points out, we should probably use a simulated dataset to start faster as the OHBM Hackathon only lasts 3 days!

htwangtw commented 4 years ago

One thing to keep in mind - some neuroimaging dataset can have different licenses for the imaging data and the phenotype data. Most of the imaging data are publicly accessible but restricted licenses are applied to phenotype information. Releasing them in examples tutorials might not be the quickest thing to do for Brainhack. At the EMEA hub, we propose to resolve this issue and maximize people's experience at brainhack in the following manner

Use neuroimaging data from nilearn to write about data processing
Use simulated data to write algorithm implementation. I don't mind people just take the code from here: https://github.com/htwangtw/cca_primer/blob/master/cca_notebook.ipynb

I will comment on different issues for the details of the relevant parts.

diiobo commented 4 years ago

One thing to keep in mind - some neuroimaging dataset can have different licenses for the imaging data and the phenotype data. Most of the imaging data are publicly accessible but restricted licenses are applied to phenotype information. Releasing them in examples tutorials might not be the quickest thing to do for Brainhack. At the EMEA hub, we propose to resolve this issue and maximize people's experience at brainhack in the following manner

Use neuroimaging data from nilearn to write about data processing

Use simulated data to write algorithm implementation. I don't mind people just take the code from here: https://github.com/htwangtw/cca_primer/blob/master/cca_notebook.ipynb

I will comment on different issues for the details of the relevant parts.

Hi,

I wanted to refresh my python skills and understand a bit the simulated data, so I ended up writing a function to generate two datasets with common latent variables (based on Hao-Ting's tutorial from her previous message), but we can simulate bigger datasets with more variables per dimension and more latent variables (perhaps useless but why not). I haven't really tested it thoroughly, but if anybody wants to play with it here it is (I still need to figure out how to fork and push so I leave a link).

https://github.com/diiobo/random/blob/master/Generate%20simulated%20data.ipynb

nadinespy commented 4 years ago

That's cool! Good idea to write a function so that number of observations, variables and components can be chosen flexibly. I think, it doesn't provide the stated hidden structure (X: [l1, l1, l1, l2, l2, l2, (l2)] and Y: [l1, l2, l1, l2, l1, l2, (l2)]), though, but I would need to look further into that. It would be also cool to have a function which allows you to flexibly choose the latent structure!

diiobo commented 4 years ago

That's cool! Good idea to write a function so that number of observations, variables and components can be chosen flexibly. I think, it doesn't provide the stated hidden structure (X: [l1, l1, l1, l2, l2, l2, (l2)] and Y: [l1, l2, l1, l2, l1, l2, (l2)]), though, but I would need to look further into that. It would be also cool to have a function which allows you to flexibly choose the latent structure!

@nadinespy It works for me if the number of (observed) variables are multiples of the number of latent variables (lx and ly are the latent structures, X and Y are the datasets with added noise to make observed variables). Tho indeed it starts to look weird when they are not multiples, because then I just replicated the last column to fill in the matrix until the corresponding number of variables is reached, and if the reminder is >2 then even more! But I also wonder whether it really matters the underlying structure or one could also just randomly assign the positions of the latent variables in the datasets (?)

nadinespy commented 4 years ago

I created an ipython notebook where I simulate three different models to run a cca and plot it (waiting for the pull request to be accepted).

LeonieBorne / plstuto

Databases to showcase #4