Explain every entry in the metadata file.

moorepants commented 9 years ago

This needs to be either in the GATK docs or in the paper or with the data.

moorepants commented 9 years ago

This is in the paper, but needs to be reviewed for completeness.

moorepants commented 9 years ago

[x] Make sure all of the files listed in the meta data are correct, especially the mapping to the compensation files.
[x] Add Cortex versions to the meta data files.

moorepants commented 9 years ago

@spinningplates @tvdbogert

I'm about to push the data to Zenodo and just went through all of the meta data in detail. I fixed a bunch of errors, but would you all mind looking through the data too to see if you notice any oddities?

You can view the tables of data indexed by trial number here:

http://nbviewer.ipython.org/github/moorepants/walking-sys-id/blob/meta-data/notebooks/meta_data_check.ipynb

tvdbogert commented 9 years ago

I did not see any obvious errors, but have a couple of comments:

The Table with the test conditions for each trial would benefit from having a subject ID number. Without that, it's quite a puzzle to find the three tests for each subject. You have to go back and forth between the Tables and use age, mass, height etc. as clues to the subject identity. It's all in the database, so you can write code to do this, but a human-readable table might be useful. Or you write the code to generate that Table?
If you introduce subject ID numbers, you no longer have to duplicate subject characteristics in the multiple trial. There can be a separate (and much shorter) table with subject characteristics. And of course, you can keep the database as it is and write code to generate that table.
We should probably not include trials that were not part of the actual study with the 3 speeds and the perturbation protocol. Unless (again...) you write code to extract a list of the relevant trials. It is kind of neat to give everything you have, but not if that makes it hard to find the data that is likely to be useful.

Ton

On 11/26/2014 11:32 AM, Jason K. Moore wrote:

@spinningplates https://github.com/spinningplates @tvdbogert https://github.com/tvdbogert

I'm about to push the data to Zenodo and just went through all of the meta data in detail. I fixed a bunch of errors, but would you all mind looking through the data too to see if you notice any oddities?

You can view the tables of data indexed by trial number here:

http://nbviewer.ipython.org/github/moorepants/walking-sys-id/blob/meta-data/notebooks/meta_data_check.ipynb

— Reply to this email directly or view it on GitHub https://github.com/csu-hmc/perturbed-data-paper/issues/17#issuecomment-64672551.

moorepants commented 9 years ago

The meta data is stored in a single file per trial (e.g., https://gist.github.com/moorepants/6bbc495128b181393023) and is located in that trial's directory. I did it this way, instead of using a proper database, to simplify things because no one in the lab seemed interested in using a real database to manage this. Thus, there is redundant "study" and "subject" data in each meta data file so that all the meta data for one trial is with the data files for that trial. The function generate_meta_data_tables() simply scrapes the directory of trials for meta data files and recursively parsers them to construct all of the singleton tables (ones without nested structure) which would be akin to single tables in a relational database. These tables are stored in DataFrame objects which are designed to allow easy reduction, grouping, joining, etc. With those tables a few lines of code are needed to form any table you like. Line 10 in the link shows an example of merging some data from two tables. If you specify what you'd like to see in a table, I can generate it for you. What you see is simply a raw parsed version so that you can visually look at all the data at one time on the screen. I will generate some simplified tables to go in the paper and the source code will be shipped along with the paper source.

I'd like to include all the trials we measured because they include potentially useful data. The code already exists that allows you to query trial numbers from the data I have. I could write some code to store the data in an HDF5 or sqlite database file and then the database can be queried with libraries that already exist instead of me writing custom bits for scraping a directory tree.

tvdbogert commented 9 years ago

It's OK to have the extra trials as long as it is not a puzzle for the reader to put the complete perturbation study together. Ideally by just extracting the right files, rather than writing code to find them.

Perhaps just this Table to generate for the paper:

column 1: subject id number columns 2-5: gender, age, mass, height column 6: 0.8 m/s trial number column 7: 1.2 m/s trial number column 8: 1.6 m/s trial number

That presents a nice birds-eye view of the dataset and helps people find the right files without much trouble.

Ton

On 11/26/2014 1:23 PM, Jason K. Moore wrote:

The meta data is stored in a single file per trial (e.g., https://gist.github.com/moorepants/6bbc495128b181393023) and is located in that trial's directory. I did it this way, instead of using a proper database, to simplify things because no one in the lab seemed interested in using a real data base to manage this. Thus, there is redundant "study" and "subject" data in each meta data file so that all the meta data for one trial is with the data files for that trial. The function |generate_meta_data_tables()| simply scrapes the directory of trials for meta data files and recursively parsers them to construct all of the singleton tables (ones without nested structure) which would be akin to single tables in a relational database. These tables are stored in |DataFrame| objects which are designed to allow easy reduction, grouping, joining, etc. With those tables a few lines of code are needed t o form any table you like. Line 10 in the link shows an example of merging some data from two tables. If you specify what you'd like to see in a table, I can generate it for you. What you see is simply a raw parsed version so that you can visually look at /all/ the data at one time on the screen. I will generate some simplified tables to go in the paper and the source code will be shipped along with the paper source.

I'd like to include all the trials we measured because they include potentially useful data. The code already exists that allows you to query trial numbers from the data I have. I could write some code to store the data in an HDF5 or sqlite database file and then the database can be queried with libraries that already exist instead of me writing custom bits for scraping a directory tree.

— Reply to this email directly or view it on GitHub https://github.com/csu-hmc/perturbed-data-paper/issues/17#issuecomment-64689443.

moorepants commented 9 years ago

Ok, I'll generate that table.

csu-hmc / perturbed-data-paper

Explain every entry in the metadata file. #17