This repository is designed to allow for the replication of DeBoever et al. 2015 starting after some of the time intensive steps such as read-mapping, expression estimation, etc. Information on the time intensive steps is available in the methods section of the paper.
This git repository holds IPython notebooks and code needed to replicate the
study. Additional data are available in this Figshare
fileset and must be downloaded
before attempting to replicate the study. After cloning this Github repository,
you can use the figshare_download
notebook to download the data from
Figshare.
If you have any trouble replicating the results, please let me know using Github's issue tracker. If you are having trouble installing dependencies, try the help resources for those software.
Here are the steps that you'll need to follow for inspecting the code and replicating the study:
figshare_download
notebook to download data from Figshareext_data
notebook to download data from external sourcesYou can clone this repository using the button on the side of the page. It is important that you don't change the name of the repository (i.e. deboever-sf3b1-2015).
There are a few non-Python dependencies but these are usually only used in one notebook. I'll point those out in the respective notebooks.
A working IPython notebook environment is needed along with some of the common scientific Python packages that you likely already have as part of a working IPython notebook environment. I recommend using Anaconda Python since it includes most of the needed packages. You can create an appropriate conda environment named sf3b1 using
conda create --name sf3b1 --file conda_env.txt
Besides the default Anaconda packages, you will need
pybedtools
(0.6.6)cdpybio
figshare
pybeeswarm
rpy2
(2.3.9)You can get pybedtools
through pip. cdpybio
, figshare
, and pybeeswarm
are included in this repository as submodules. After cloning this repository
from Github, change into the repo directory and run:
git submodule init
git submodule update
You will then need to install the python packages using python setup.py install
. These packages are not used in every notebook, so you can probably
get away without installing some of them if you only want to run certain
notebooks.
You will also need to install the project specific Python package ds2014
from
this repository. From the deboever-sf3b1-2015
directory, you can change into
ds2014
and install using python setup.py install
or python setup.py develop
if you think you may want to make changes to the ds2014
package and
have these changes instantly propagated without re-installing the package.
After cloning the repository and installing the dependencies, you should be
able to just run the figshare_download
and ext_data
notebooks to download
the necessary data.
The IPython notebooks for this project are somewhat ordered in how they should be run because some notebooks rely on the output of other notebooks. However, all of the intermediate files are downloaded from Figshare so you should be able to run any notebook if you've downloaded all of the files from Figshare. Note that some notebooks will not actually recreate their output files if the files already exists:
if not os.path.exists([some output file]):
[do analysis]
else:
[read output file that already exists]
You can delete (or better, rename or move) output files to ensure they are recreated when you run a given notebook.
Downloaded from Figshare. Contains primary data files (i.e. those that aren't created by any of the code here). You shouldn't delete or alter these.
This directory contains a Python package specific to this project. See the Dependencies section of this README for installation instructions.
This directory holds data downloaded externally that I trust will still be
available in the future. These data are downloaded using the ext_data
notebook. There may be some other notebooks that store data in this directory
as well but I tried to move everything into the ext_data
notebook.
IPython notebooks tracked by git. Use these to rerun different analyses.
Output from IPython notebooks. Contains intermediate files (i.e. files that are created using the primary data and other external data) as well as images, tables, etc. For figures that require some Illustrator manipulation (currently Figures 2 and 3), I try to copy the final figure into the output folder as well.
This directory contains non-Python software. Most of these software are only used in one notebook so the details are contained within those respective notebooks.
The notebook numbers_from_paper.ipynb
contains commands that print most of
numbers/statistics in the paper. Numbers in the figure legends are printed out
in the figure notebooks however (although they may be duplicated in
number_from_paper.ipynb
).
Each figure has its own notebook (e.g. figure01.ipynb
or sfigure01.ipynb
).
These notebooks create the entire figure in one go except for a couple cases
where some parts of the figure have to be inserted manually. The things to be
inserted manually are either created by the notebook and saved in its output
folder or they are available in the data
directory from Figshare.
In the manuscript, I refer to 3' splice sites only as 3' splice sites or 3'SSs. However, in the code and notebooks for this project, I sometimes refer to them as acceptors. Similarly, I refer to 5' splice sites as donors in the code and notebooks in some cases. I generally try to use "junction" to refer to a gap spanned in the RNA-seq data and "splice site" to refer to an annotated or cryptic splice site.
In the manuscript, I refer to 3'SSs or other features annotated in Gencode as "canonical." However, I may refer to these as "annotated" (or "annot" for short) in the code and notebooks at various points. I also refer to cryptic 3'SSs as "novel" splice sites in some places.