cortex-lab / alyx

Database for experimental neuroscience laboratories
44 stars 11 forks source link

new model: TimeSeries #98

Closed kdharris101 closed 7 years ago

kdharris101 commented 7 years ago

A timeseries document links to a file containing a time series or multiple timeseries file: dataset # txn array where t is number of timepoints and n is number of traces column_names: array field of base field 1024char. Length is equal to number of traces. description: text timestamps: array of datasets (of the new type, Timestamp). Can be more than one. experiment_id: link to actions table

rossant commented 7 years ago

I'm not exactly sure how to model this.

class TimeSeries(BaseModel):
    file = models.ForeignKey(Dataset, help_text="txn array where t is number of timepoints "
                             "and n is number of traces")
    column_names = models.ListField()  # ???
    description = models.TextField(null=True, blank=True)
    timestamps = models.ManyToManyField(Timestamp)
    experiment = models.ForeignKey(BaseAction)

Questions:

@nippoo @nsteinme

nsteinme commented 7 years ago

Looks like there is something called an ArrayField, will that work? https://docs.djangoproject.com/en/1.10/ref/contrib/postgres/fields/ Otherwise yes, I think a comma-separated string is the way to go for simplicity.

The ForeignKey for experiment is to actions.models.Experiment

This is the link to the dataset and some metadata! I don't know whether I understand the last question.

rossant commented 7 years ago

Right, ArrayField seems like a good solution. I guess my last question was: generally speaking, how do you intend to use this? Why not just use a Dataset for example? What are your use cases exactly?

nsteinme commented 7 years ago

So, this would be used to store analog traces recorded by some device. For instance, we currently record multiple signals with a national instruments data acquisition card, including: position of the wheel (i.e. behavioral manipulandum), photodiode reading (to synchronize visual stimulus presentation), strobes from cameras (to synchronize), etc.

Or any other set of analog traces that share a common timebase, for instance including electrophysiological data.

So it's a very general (and important) type of dataset.

Currently dataset doesn't have the fields for column names or timestamps, which of course we need to record. How would you recommend we record these?

rossant commented 7 years ago

OK I understand. I imagined you could save the column names directly in the dataset, on disk, rather than in the database, but that's perfectly fine.

nsteinme commented 7 years ago

Currently these datasets are in the form of matlab structs, which have one field for the data matrix, one field for the timestamps, and a cell array of channel names. But we want to do npy files for the new system. So perhaps we should pick a new standard filename for that kind of metadata? Like: myData.npy myData.timestamps.npy myData.columnNames.npy % string with comma-separated column names or myData.columnNames.json @kdharris101 ?

kdharris101 commented 7 years ago

This actually raises an important general question. To what extent do we want metadata to be in the npy files, and to what extent in the database. There is no harm in duplicating, but in this case we need to decide which one is “master”. The way I did this before – when I used an sql system as a postdoc, and before that working as a db programmer in a phone company – is that files are the master, and the database is a tool that allows you to search files easily, and you are prepared to wipe and recreate the database from files all the time. However I realize this is not how things usually work in industry, the database is usually the master.

One other point: comma-delimited text files are a disaster. Because all it takes is someone to type in a comma into a text field and everything is screwed. Tab-delimited is less likely to have this problem. But json is surely best.

From: nsteinme [mailto:notifications@github.com] Sent: 24 February 2017 17:27 To: cortex-lab/alyx alyx@noreply.github.com Cc: Harris, Kenneth kenneth.harris@ucl.ac.uk; Mention mention@noreply.github.com Subject: Re: [cortex-lab/alyx] new model: TimeSeries (#98)

Currently these datasets are in the form of matlab structs, which have one field for the data matrix, one field for the timestamps, and a cell array of channel names. But we want to do npy files for the new system. So perhaps we should pick a new standard filename for that kind of metadata? Like: myData.npy myData.timestamps.npy myData.columnNames.npy % string with comma-separated column names or myData.columnNames.json @kdharris101https://github.com/kdharris101 ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/cortex-lab/alyx/issues/98#issuecomment-282351136, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AD7KlVMS4L-u_Fw3GfA_0a6RzoQX-A9Tks5rfxLlgaJpZM4MJ_3q.

nsteinme commented 7 years ago

Yes, you're right, let's go with json on the column names.

On Sun, Feb 26, 2017 at 11:33 AM, Kenneth Harris notifications@github.com wrote:

This actually raises an important general question. To what extent do we want metadata to be in the npy files, and to what extent in the database. There is no harm in duplicating, but in this case we need to decide which one is “master”. The way I did this before – when I used an sql system as a postdoc, and before that working as a db programmer in a phone company – is that files are the master, and the database is a tool that allows you to search files easily, and you are prepared to wipe and recreate the database from files all the time. However I realize this is not how things usually work in industry, the database is usually the master.

One other point: comma-delimited text files are a disaster. Because all it takes is someone to type in a comma into a text field and everything is screwed. Tab-delimited is less likely to have this problem. But json is surely best.

From: nsteinme [mailto:notifications@github.com] Sent: 24 February 2017 17:27 To: cortex-lab/alyx alyx@noreply.github.com Cc: Harris, Kenneth kenneth.harris@ucl.ac.uk; Mention < mention@noreply.github.com> Subject: Re: [cortex-lab/alyx] new model: TimeSeries (#98)

Currently these datasets are in the form of matlab structs, which have one field for the data matrix, one field for the timestamps, and a cell array of channel names. But we want to do npy files for the new system. So perhaps we should pick a new standard filename for that kind of metadata? Like: myData.npy myData.timestamps.npy myData.columnNames.npy % string with comma-separated column names or myData.columnNames.json @kdharris101https://github.com/kdharris101 ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ cortex-lab/alyx/issues/98#issuecomment-282351136, or mute the thread< https://github.com/notifications/unsubscribe-auth/AD7KlVMS4L-u_Fw3GfA_ 0a6RzoQX-A9Tks5rfxLlgaJpZM4MJ_3q>.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cortex-lab/alyx/issues/98#issuecomment-282549652, or mute the thread https://github.com/notifications/unsubscribe-auth/AHPUP0B37yuEn75w8K5NAI66rFesPIZRks5rgWLwgaJpZM4MJ_3q .

rossant commented 7 years ago

Yes, you're right, let's go with json on the column names.

Or an ArrayField? I see that ArrayFields are already used in the code base.

nsteinme commented 7 years ago

I guess we want to have the column names also on disk, as per Kenneth's suggestion that this sort of thing be re-build-able, which seems sensible to me. So json for disk, array field for database, if that seems all right to you.

On Sun, Feb 26, 2017 at 12:59 PM, Cyrille Rossant notifications@github.com wrote:

Yes, you're right, let's go with json on the column names.

Or an ArrayField? I see that ArrayFields are already used in the code base.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cortex-lab/alyx/issues/98#issuecomment-282554151, or mute the thread https://github.com/notifications/unsubscribe-auth/AHPUP-ezM0M3hQt4QrJEH5eYXKXKL-2pks5rgXcWgaJpZM4MJ_3q .

kdharris101 commented 7 years ago

Within the database, no reason not to have an ArrayField.

If we are talking about an external file I would go with either tab-delimited or json. We should try to be consistent in how we represent text/mixed-type data in external files. The advantage of tab-delimited is human readability. The advantage of json is flexibility. Thinking more about it I tend towards tab-delimited.

The more general question is what are we planning to represent in external files, that isn’t purely numerical?

From: Cyrille Rossant [mailto:notifications@github.com] Sent: 26 February 2017 12:59 To: cortex-lab/alyx alyx@noreply.github.com Cc: Harris, Kenneth kenneth.harris@ucl.ac.uk; Mention mention@noreply.github.com Subject: Re: [cortex-lab/alyx] new model: TimeSeries (#98)

Yes, you're right, let's go with json on the column names.

Or an ArrayField? I see that ArrayFields are already used in the code base.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/cortex-lab/alyx/issues/98#issuecomment-282554151, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AD7KlUxJKeOD8NLV2Gff_IBNe1Vf7yArks5rgXcWgaJpZM4MJ_3q.

rossant commented 7 years ago

What's the drawback of just saving the column names on disk and not in the database? Why do we need them in the database, do we ever need to do queries on these column names?

nsteinme commented 7 years ago

You might want to search for datasets with a column called "piezoLickDetector" for instance, if you want to find datasets that have that kind of data to analyze. That's probably a rare use case, but it just feels insufficient to me to have the database say "here's a 15 x 10000 matrix. No idea what it contains." Why not be able to label?

If it helps this discussion at all, here are examples of the contents of the four common kinds of structs that are saved by experiments in the lab. "block" is for behavioral experiments with Choiceworld and for signals. "timeline" is signals recorded by a ni-daq. "parameters" can be any kind of experiment's parameters. "Protocol" is an mpep thing. I think we're doing well though - there are some things that will be fine in EventSeries, some that are fine in TimeSeries, and the rest either match existing exp-metadata fields (like start_time) or will be json, so we should be set!

block = 

                     expType: 'ChoiceWorld'
                       trial: [1x229 struct]
       stimWindowUpdateTimes: [5656x1 double]
        stimWindowUpdateLags: [5656x1 double]
               startDateTime: 7.3668e+05
            startDateTimeStr: '18-Dec-2016 18:56:57'
                  parameters: [1x1 struct]
                   endStatus: 'quit'
        rewardDeliveredSizes: [155x1 double]
         rewardDeliveryTimes: [1x155 double]
                     rigName: 'zgood'
                      expRef: '2016-12-18_2_Cori'
          experimentInitTime: 0.2619
       experimentStartedTime: 0.2702
         experimentEndedTime: 1.1945e+03
       experimentCleanupTime: 1.1945e+03
                 endDateTime: 7.3668e+05
              endDateTimeStr: '18-Dec-2016 19:16:51'
          numCompletedTrials: 228
                    duration: 1.1943e+03
        inputSensorPositions: [372390x1 double]
    inputSensorPositionTimes: [372390x1 double]
             inputSensorGain: 10.0025
                  lickCounts: []
              lickCountTimes: []

>> block.trial(1)

ans = 

                        condition: [1x1 struct]
                 trialStartedTime: 0.2722
          intermissionStartedTime: 0.2732
       quiescenceWatchStartedTime: 0.2807
                      visCuePhase: [4.8067 4.7062]
         quiescenceWatchEndedTime: 9.6450
               quiescentEpochTime: 9.6460
            intermissionEndedTime: 9.6470
         onsetToneSoundPlayedTime: [9.6566 10.3816]
    stimulusBackgroundStartedTime: 9.6606
           stimulusCueStartedTime: 9.6616
           interactiveStartedTime: 10.3716
          interactiveZeroInputPos: -49108
          interactiveMovementTime: [10.3855 10.3909 10.4041 10.4206 10.4372 10.4535 10.4703]
        inputThresholdCrossedTime: 10.4767
             interactiveEndedTime: 10.4835
                 responseMadeTime: 10.4871
              feedbackStartedTime: 10.4881
                     feedbackType: 1
      feedbackPositiveStartedTime: 10.4892
                   responseMadeID: 1
          inputThresholdCrossedID: 1
        feedbackPositiveEndedTime: 11.4735
                feedbackEndedTime: 11.4746
             stimulusCueEndedTime: 11.4755
      stimulusBackgroundEndedTime: 11.4765
                   trialEndedTime: 11.4774
      feedbackNegativeStartedTime: []
       negFeedbackSoundPlayedTime: []
        feedbackNegativeEndedTime: []

>> parameters

parameters = 

                            experimentFun: @UNKNOWN Function
                 experimentFunDescription: 'Function to create the experiment, takes 2 arguments: the pa...'
                                     type: 'ChoiceWorld'
                             rewardVolume: 2.8000
                        rewardVolumeUnits: 'µl'
                  rewardVolumeDescription: 'Reward volumn delivered on each correct trial'
                        onsetVisStimDelay: 0
                   onsetVisStimDelayUnits: 's'
             onsetVisStimDelayDescription: 'Duration between the start of the onset tone and visual stim...'
                        onsetToneDuration: 0.1000
                   onsetToneDurationUnits: 's'
             onsetToneDurationDescription: 'Duration of the onset tone'
                    onsetToneRampDuration: 0.0100
               onsetToneRampDurationUnits: 's'
         onsetToneRampDurationDescription: 'Duration of the onset tone amplitude ramp (up and down each ...'
                   preStimQuiescentPeriod: [2x1 double]
              preStimQuiescentPeriodUnits: 's'
        preStimQuiescentPeriodDescription: 'Required period of no input before stimulus presentation'
                               bgCueDelay: 0
                          bgCueDelayUnits: 's'
                    bgCueDelayDescription: 'Delay period between target column presentation and grating cue'
                      cueInteractiveDelay: [2x1 double]
                 cueInteractiveDelayUnits: 's'
           cueInteractiveDelayDescription: 'Delay period between grating cue presentation and interactiv...'
                           responseWindow: 1.5000
                      responseWindowUnits: 's'
                responseWindowDescription: 'Duration of window allowed for making a response'
                   positiveFeedbackPeriod: 1
              positiveFeedbackPeriodUnits: 's'
        positiveFeedbackPeriodDescription: 'Duration of positive feedback phase (with stimulus locked in...'
                   negativeFeedbackPeriod: 1
              negativeFeedbackPeriodUnits: 's'
        negativeFeedbackPeriodDescription: 'Duration of negative feedback phase (with stimulus locked in...'
 [etc]

>> Timeline

Timeline = 

                       expRef: '2016-12-18_1_Cori'
                    savePaths: {2x1 cell}
                    isRunning: 0
                           hw: [1x1 struct]
                   rawDAQData: [9392500x19 double]
            rawDAQSampleCount: 9392500
                       datFID: 3
                startDateTime: 7.3668e+05
             startDateTimeStr: '18-Dec-2016 18:56:11'
               nextChronoSign: -1
                lastTimestamp: 3.7570e+03
         lastClockSentSysTime: 5.3788e+06
    currSysTimeTimelineOffset: 5.3750e+06
                    figHandle: []
             rawDAQTimestamps: [1x9392500 double]

>> Timeline.hw

ans = 

                    daqVendor: 'ni'
                    daqDevice: 'Dev1'
                daqSampleRate: 2500
          daqSamplesPerNotify: []
        chronoOutDaqChannelID: 'port0/line1'
          acqLiveDaqChannelID: 'port0/line8'
               useClockOutput: 1
         clockOutputChannelID: 'ctr1'
         clockOutputFrequency: 70
         clockOutputDutyCycle: 0.1000
      clockOutputInitialDelay: 0.5000
                 camSyncPulse: 1
    camSyncPulsePauseDuration: 0.2000
          camSyncDaqChannelID: 'port0/line3'
                    stopDelay: 2
                    makePlots: 1
                  figPosition: [50 50 1700 900]
                    figScales: [1 0.5000 3 1 1 1 10 1 1 10 1 8 1 1 1 1 1 1 1]
                  recordAudio: 0
               audioRecDevice: 1
                   audioRecFs: 192000
                     writeDat: 1
                     dataType: 'double'
             samplingInterval: 4.0000e-04
                       inputs: [1x19 struct]
            arrayChronoColumn: 1

>> columnLabels = {Timeline.hw.inputs.name}

columnLabels = 

  Columns 1 through 6

    'chrono'    'photoDiode'    'rotaryEncoder'    'eyeCameraStrobe'    'waveOutput'    'openChan1'

  Columns 7 through 12

    'piezoLickDetector'    'openChan2'    'camSync'    'whiskCamStrobe'    'rewardEcho'    'audioMonitor'

  Columns 13 through 17

    'faceCamStrobe'    'blueLEDmonitor'    'purpleLEDmonitor'    'pcoExposure'    'acqLive'

  Columns 18 through 19

    'tlExposeClock'    'stimScreen'

>> Protocol

Protocol = 

            xfile: 'stimGratingAndLaserCommands.x'
            adapt: [1x1 struct]
            nstim: []
    npfilestimuli: 28
            npars: 28
             pars: [28x28 double]
         parnames: {28x1 cell}
          pardefs: {28x1 cell}
           animal: 'Noam'
          iseries: '2016-12-11'
             iexp: 5
         nrepeats: 20
          seqnums: [28x20 double]
rossant commented 7 years ago

should be done. see https://github.com/cortex-lab/alyx/commit/166b7ceec564995a367ca67bbc38e04c355695cf#diff-488537eccebb33b949b0a1235628c053R156 if you want to double check

nsteinme commented 7 years ago

When I was writing SQL queries I realized that there appears to be a convention, which is that fields with uuid's as values have "_id" at the end of the field name - did you notice that, is it true? If so let's probably go with file_id, experiment_id, etc?

On Mon, Feb 27, 2017 at 3:05 PM, Cyrille Rossant notifications@github.com wrote:

should be done. see 166b7ce#diff-488537eccebb33b949b0a1235628c053R156 https://github.com/cortex-lab/alyx/commit/166b7ceec564995a367ca67bbc38e04c355695cf#diff-488537eccebb33b949b0a1235628c053R156 if you want to double check

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cortex-lab/alyx/issues/98#issuecomment-282745444, or mute the thread https://github.com/notifications/unsubscribe-auth/AHPUP7LBClleyYpawzCQwFmufsnAP-12ks5rguYigaJpZM4MJ_3q .

kdharris101 commented 7 years ago

Yes, I believe that was the convention.

rossant commented 7 years ago

I think django automatically postpends _id for the SQL columns, but this suffix should not appear in the Python models. See https://docs.djangoproject.com/en/1.10/ref/models/fields/#database-representation

nsteinme commented 7 years ago

Got it, in that case looks good as far as I can see.

On Mon, Feb 27, 2017 at 6:21 PM, Cyrille Rossant notifications@github.com wrote:

I think django automatically postpends _id for the SQL columns, but this suffix should not appear in the Python models.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cortex-lab/alyx/issues/98#issuecomment-282804588, or mute the thread https://github.com/notifications/unsubscribe-auth/AHPUP8r3JoEsFwz_an_Golr6JyXnfJOuks5rgxRGgaJpZM4MJ_3q .