csu-hmc / perturbed-data-paper

A paper on an elaborate gait data set.
https://peerj.com/articles/918/
10 stars 6 forks source link

Explain every entry in the metadata file. #17

Closed moorepants closed 9 years ago

moorepants commented 9 years ago

This needs to be either in the GATK docs or in the paper or with the data.

moorepants commented 9 years ago

This is in the paper, but needs to be reviewed for completeness.

moorepants commented 9 years ago
moorepants commented 9 years ago

@spinningplates @tvdbogert

I'm about to push the data to Zenodo and just went through all of the meta data in detail. I fixed a bunch of errors, but would you all mind looking through the data too to see if you notice any oddities?

You can view the tables of data indexed by trial number here:

http://nbviewer.ipython.org/github/moorepants/walking-sys-id/blob/meta-data/notebooks/meta_data_check.ipynb

tvdbogert commented 9 years ago

I did not see any obvious errors, but have a couple of comments:

Ton

On 11/26/2014 11:32 AM, Jason K. Moore wrote:

@spinningplates https://github.com/spinningplates @tvdbogert https://github.com/tvdbogert

I'm about to push the data to Zenodo and just went through all of the meta data in detail. I fixed a bunch of errors, but would you all mind looking through the data too to see if you notice any oddities?

You can view the tables of data indexed by trial number here:

http://nbviewer.ipython.org/github/moorepants/walking-sys-id/blob/meta-data/notebooks/meta_data_check.ipynb

— Reply to this email directly or view it on GitHub https://github.com/csu-hmc/perturbed-data-paper/issues/17#issuecomment-64672551.

moorepants commented 9 years ago

The meta data is stored in a single file per trial (e.g., https://gist.github.com/moorepants/6bbc495128b181393023) and is located in that trial's directory. I did it this way, instead of using a proper database, to simplify things because no one in the lab seemed interested in using a real database to manage this. Thus, there is redundant "study" and "subject" data in each meta data file so that all the meta data for one trial is with the data files for that trial. The function generate_meta_data_tables() simply scrapes the directory of trials for meta data files and recursively parsers them to construct all of the singleton tables (ones without nested structure) which would be akin to single tables in a relational database. These tables are stored in DataFrame objects which are designed to allow easy reduction, grouping, joining, etc. With those tables a few lines of code are needed to form any table you like. Line 10 in the link shows an example of merging some data from two tables. If you specify what you'd like to see in a table, I can generate it for you. What you see is simply a raw parsed version so that you can visually look at all the data at one time on the screen. I will generate some simplified tables to go in the paper and the source code will be shipped along with the paper source.

I'd like to include all the trials we measured because they include potentially useful data. The code already exists that allows you to query trial numbers from the data I have. I could write some code to store the data in an HDF5 or sqlite database file and then the database can be queried with libraries that already exist instead of me writing custom bits for scraping a directory tree.

tvdbogert commented 9 years ago

It's OK to have the extra trials as long as it is not a puzzle for the reader to put the complete perturbation study together. Ideally by just extracting the right files, rather than writing code to find them.

Perhaps just this Table to generate for the paper:

column 1: subject id number columns 2-5: gender, age, mass, height column 6: 0.8 m/s trial number column 7: 1.2 m/s trial number column 8: 1.6 m/s trial number

That presents a nice birds-eye view of the dataset and helps people find the right files without much trouble.

Ton

On 11/26/2014 1:23 PM, Jason K. Moore wrote:

The meta data is stored in a single file per trial (e.g., https://gist.github.com/moorepants/6bbc495128b181393023) and is located in that trial's directory. I did it this way, instead of using a proper database, to simplify things because no one in the lab seemed interested in using a real data base to manage this. Thus, there is redundant "study" and "subject" data in each meta data file so that all the meta data for one trial is with the data files for that trial. The function |generate_meta_data_tables()| simply scrapes the directory of trials for meta data files and recursively parsers them to construct all of the singleton tables (ones without nested structure) which would be akin to single tables in a relational database. These tables are stored in |DataFrame| objects which are designed to allow easy reduction, grouping, joining, etc. With those tables a few lines of code are needed t o form any table you like. Line 10 in the link shows an example of merging some data from two tables. If you specify what you'd like to see in a table, I can generate it for you. What you see is simply a raw parsed version so that you can visually look at /all/ the data at one time on the screen. I will generate some simplified tables to go in the paper and the source code will be shipped along with the paper source.

I'd like to include all the trials we measured because they include potentially useful data. The code already exists that allows you to query trial numbers from the data I have. I could write some code to store the data in an HDF5 or sqlite database file and then the database can be queried with libraries that already exist instead of me writing custom bits for scraping a directory tree.

— Reply to this email directly or view it on GitHub https://github.com/csu-hmc/perturbed-data-paper/issues/17#issuecomment-64689443.

moorepants commented 9 years ago

Ok, I'll generate that table.