data format - Githubissues

freeman-lab commented 8 years ago

Opening a discussion for how to format both the input data and the results / submissions.

According to @philippberens the raw data will be both calcium fluorescence and spike rates both sampled to 100 Hz.

formatting the raw data

The raw data are basically just time series, continuous valued (for fluorescence) and possibly sparse (for spike rates). The key thing here is that the format is generic and easy to load in multiple environments. I kinda prefer csv files for simplicity, so long as they don't get too large. Then, for each dataset, we provide either a single csv file or two csv files, depending on whether it's training or testing. And we include example scripts in this repo to load in Python, Matlab, and any other language.

training / testing

How many datasets / neurons do we have? If it's less than 10-20, it might be easiest to just treat each neuron as a separate "dataset", and pair them up so we have e.g. 00.00 and 00.00.test, then 00.01, 01.00, etc, where the first number is the source lab and the second number is the neuron.

formatting the results

Using JSON here is useful because it can easily be read/write in multiple environments (for comparison to ground truth), and is easily handled for web submissions. It's been successful so far in neurofinder for representing spatial regions.

The results are likely to be sparse in time, so one option would be a structure like this

[
  {
    "dataset": "00.00.test",
    "time": [0, 10, 14, 100, ...],
    "rate": [1, 2, 1, 1.5, ...]
  },
...
]

For each dataset we basically have a sparse array, where you're storing the times of all detected events, and the corresponding numerical value. For algorithms that return binary events, we could assume that if no rate is specified all values are 1.

philippberens commented 8 years ago

@freeman-lab, I just shared a dropbox folder with all the data with you.

The datasets are numbered as in table 1 in Theis et al. 2016. For basic visualization, run the gettingstarted-notebook (which we prepared for a CRCNS submission of the training data).

I would go with the preprocessed versions, which are resampled to 100 Hz. I would rather store all data in dense arrays, as the calcium is dense anyway and the data isn't that large.

We have a total of 72 cells, for some of them multiple segments, so a total of 90 traces. Of these we allocated 32 to the test set. This allocation makes sure that multiple segments of the same cell are either in the training or the testing set and that all previously shared data is in the training set.

The naming of the files is fine with me, I would use the first number for the dataset to refer to the Table 1 in Theis 2016 and the second for the cell/trace.

The JSON file format is fine with me. Do I read the documentation correctly that you basically supply a dictionary and then use dump? (https://simplejson.readthedocs.io/en/latest/)

philippberens commented 8 years ago

Ah, one more thing. For some of the datasets, we don't have exact spike times, so just providing spike times unfortunately is not an options.

freeman-lab commented 8 years ago

@philippberens great! I went through the datasets and example scripts, this is really, really well put together!

Given that the spike times won't be enough on their own, and that we'll use dense arrays, I now think CSV makes more sense for submissions than JSON. (My only preference for CSV over mat / pickle btw is that we can use one language agnostic format, instead of supporting two).

And given that the size is small, we can just put all cells for each dataset in a single file. Can definitely use the same numbering as Table 1 from the paper.

In that case the full set should look like

1.train.calcium.csv
1.train.spikes.csv
2.train.calcium.csv
2.train.spikes.csv
...

1.test.calcium.csv
2.test.calcium.csv
...

where each csv file has as header the cell number (for that dataset), and the rows are the values at each time point (calcium or spikes, depending on the file), e.g.

1	2	3
3.123	0.151	8.123
1.972	0.195	8.519
1.412	5.012	4.123

For submissions, we can just require that people make five spikes.csv files, one for each testing dataset, and submit all five at once. So a complete submission would basically be

1.test.spikes.csv
2.test.spikes.csv
3.test.spikes.csv
4.test.spikes.csv
5.test.spikes.csv

philippberens commented 8 years ago

fine with me - the columns then will have different lengths, I guess that doesn't matter. do you want to convert the data or should I, @freeman-lab ?

freeman-lab commented 8 years ago

@philippberens i'm on it! Just about done actually, will post links here for you to download and check.

freeman-lab commented 8 years ago

@philippberens ok, want to grab this and make sure it looks right?

https://s3.amazonaws.com/neuro.datasets/challenges/spikefinder/spikefinder.train.zip

While doing this I updated README.md and added an example.py script for loading in Python, which should be included in the zip, and is also in this repository. I confirmed that loading and plotting the first couple neurons from dataset 1 looked identical to what I got running your notebook.

I also have a spikefinder.test with just the calcium test data, and a spikefinder.test.private with the spike test data, which we'll of course keep private!

philippberens commented 8 years ago

I agree, I checked a couple examples as well, looks good.

codeneuro / spikefinder-datasets

data format #1

formatting the raw data

training / testing

formatting the results