Charlene Bultoc has just started a post-doc at an important neuro-science institute. She is doing research on a new methodology to analyse signals in our brains detected through a combination of CT and MRI. Using image processing techniques she can simplify the whole dataset into a grid of 20x20 arrays.
Her theory is that the average of such signals through the sagittal plane is constant over time, so she has written some software to calculate this. She decided to write that software in Python so she could share it (via GitHub, sagittal_average) with people from other labs. She didn't know as much Python when she started as she does now, so you can see that evolution in her program.
Charlene is an advocate of reproducibility, and as such she has been keeping track of what versions she's run for each of her results. "That's better than keeping just the date!" you can hear her saying. So for each batch of images she processes she creates a file versions.txt with a content like:
With that information she can go and run the same analysis again and again and be as reproducible as she can.
However she's found that sagittal_average has a problem... and she needs to re-analyse all the data since that bug was introduced. Running the analysis for all the data she's produced is not viable as each run takes three days to execute - assuming she has the resources available in the university cluster, and she has more than 300 results.
In all the versions of the program, it reads and writes csv files. Charlene has improved the program considerably over the time, but kept the same defaults (specifically, there are an input file, brain_sample.csv, and an output file, brain_average.csv). She has always "tested" her program with the brain_sample.csv input file provided in the repository. However (and that's part of the problem!), the effect of the bug is not noticeable with that file.
We can then help her either by letting her use our laptops or (better) by finding when the bug was introduced and then run only the ones that need to be re-analysed.
Finding when the bug was introduced seems the quickest way. Download the repository with her sagittal_average.py script and use git bisect to find the commit at which the script started to give wrong results.
You may need to create the brain_sample.csv file each time you move through the commits.
Use bisect manually until you find the introduction of the error. Take note of the hash and date of the commit that introduced the bug - you will need this information in class.
Charlene Bultoc has just started a post-doc at an important neuro-science institute. She is doing research on a new methodology to analyse signals in our brains detected through a combination of CT and MRI. Using image processing techniques she can simplify the whole dataset into a grid of 20x20 arrays.
Her theory is that the average of such signals through the sagittal plane is constant over time, so she has written some software to calculate this. She decided to write that software in Python so she could share it (via GitHub, sagittal_average) with people from other labs. She didn't know as much Python when she started as she does now, so you can see that evolution in her program.
Charlene is an advocate of reproducibility, and as such she has been keeping track of what versions she's run for each of her results. "That's better than keeping just the date!" you can hear her saying. So for each batch of images she processes she creates a file
versions.txt
with a content like:With that information she can go and run the same analysis again and again and be as reproducible as she can.
However she's found that
sagittal_average
has a problem... and she needs to re-analyse all the data since that bug was introduced. Running the analysis for all the data she's produced is not viable as each run takes three days to execute - assuming she has the resources available in the university cluster, and she has more than 300 results.In all the versions of the program, it reads and writes csv files. Charlene has improved the program considerably over the time, but kept the same defaults (specifically, there are an input file,
brain_sample.csv
, and an output file,brain_average.csv
). She has always "tested" her program with thebrain_sample.csv
input file provided in the repository. However (and that's part of the problem!), the effect of the bug is not noticeable with that file.We can then help her either by letting her use our laptops or (better) by finding when the bug was introduced and then run only the ones that need to be re-analysed.
Finding when the bug was introduced seems the quickest way. Download the repository with her
sagittal_average.py
script and usegit bisect
to find the commit at which the script started to give wrong results.Do it manually first (as explained in this section of the notes).
Steps to help Charlene:
Create a new input file to figure out what the bug is Hint: You can generate an input file that does show the error using the code snippet below:
You may need to create the
brain_sample.csv
file each time you move through the commits.Sample solution