caporaso-lab / student-microbiome-project

Central repository for data and analysis tools for the StudentMicrobiomeProject.
9 stars 3 forks source link

Added script to map Dan's disturbance data #17

Closed jairideout closed 11 years ago

jairideout commented 11 years ago

Added script to map Dan's disturbance data into 3 new columns in the SMP mapping file (antibiotic, sickness, and menstruation disturbances).

This is a very quick-and-dirty script with no optparse interface or unit tests like we'd see in QIIME. If necessary, I can add these (this pull request doesn't need to get merged until we're happy with the script). I also need to check with @floresg in our meeting to see if I'm mapping the weeks correctly between what Dan recorded and the WeeksSinceStart column.

I've also included the input (TSV exports of the SMP mapping file and Dan's disturbance table) and the output of running the script (new_smp_map.txt).

This is related to issue #16, though it does not resolve it as there are multiple parts to the issue.

gregcaporaso commented 11 years ago

Thanks @jrrideout - if you plan to take a minute to describe how this is being mapped during the call @floresg can confirm/correct as necessary. We'll merge after that.

jairideout commented 11 years ago

I added a new script called convert_disturbance_list_weeks.py which writes out a new disturbance table (TSV) with standard PIDs (i.e. with school name and 3-digit ID) and Dan's weeks mapped to WeeksSinceStart.

To run it, execute it like so:

python convert_disturbance_list_weeks.py

It assumes the disturbance list is in TSV format, in the current directory, and called disturbance_list.txt. It also assumes the existence of smp_map.txt, which is the SMP mapping file (in TSV format). It outputs a new TSV file called new_disturbance_list.txt with the updated week mappings and PIDs.

The script performs sanity checks on the data by making sure that a WeekDescription to WeeksSinceStart mapping for an individual is not ambiguous (i.e. it always maps to the same thing). It will throw an error with the current SMP mapping file because of that bad mapping you found earlier today with one of the NCS individuals. Once that is corrected (it should be corrected before continuing), the script will continue running.

It will also print out a warning if a PID is found in Dan's table that is not in the SMP mapping file. This happens for a handful of the cells. If this occurs, two question marks are prepended to the start of the cell's original contents.

If a week in Dan's table doesn't have a mapping for that individual in the SMP mapping file (e.g. the 0.5, 7.5, etc. weeks), a single question mark is prepended to the cell, and the original weeks for that cell are left intact. The PID is updated, however, to include school code and 3-digits.

This code was very hastily thrown together, it sucks, and has not been extensively tested. Greg, since I'll be on travel, feel free to merge the pull request and then make changes as necessary to either or both scripts. Sorry to leave this in such a state, but hopefully this will help with the mappings! Please don't hesitate to contact me if you have questions or run into issues (I'll try my best to reply ASAP).

Thanks!