This is a pipeline for calculating FRiP ("fraction of reads in peaks") from fastq (or bed + fastq).
git clone https://github.com/Fudenberg-Research-Group/fastaFRiP.git
cd fastaFRiP/frip_sm
conda env create -f env/fastq_frip_env.yml -n fastq_frip_env
conda activate fastq_frip_env
All dependecies mentioed below are include in our conda environment, so you don't need to worry about any further installation :)
FASTQ data is available in the Gene Expression Omnibus (GEO). FASTQ is a common format for storing raw sequencing data generated by next-generation sequencing technologies. In GEO, such raw sequencing data are often included as part of the supplementary files associated with a GEO Series (GSE) record.
By clicking on the 'SRA Run Selector', users can select and download specific data (e.g., based on organism, gene, condition, or experiment type) from the Sequence Read Archive (SRA) page.
To quickly access the accession codes of ChIP-seq experiments, you can click "Metadata" button on the page to get "SraRunTable.txt" and use the following command:
grep "ChIP" SraRunTable.txt | awk -F, '{print $1}' > accessions.txt
This command extracts the codes for each file, which can later be used to download the necessary data.
We have provided a script, batch_download.sh, to facilitate the data download (In this script, we use fasterq-dump
, which comes as part of sra-tools
) \
You can run the script by having the 'accession.txt' in the same folder:
./batch_download.sh
We got our index files from NCBI or UCSC genome browser. From NCBI, you can choose to use bowtie2 index files directly, or download reference genome for alignment to make your own bowtie2 index files.
Most of time, you can use bowtie2 index files directly by running the following command:
tar -xvzf GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.tar.gz
However, sometimes you might encounter spike-in ChIP-seq. Then you can use the following way to create a bowtie index files that include two species, here we use hg38 and mm39 as an example:
gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
gunzip GCA_000001635.9_GRCm39_full_analysis_set.fna.gz
sed -i '1s/^>/>hg38_/' GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
sed -i '1s/^>/>mm39_/' GCA_000001635.9_GRCm39_full_analysis_set.fna
cat GCA_000001405.15_GRCh38_no_alt_analysis_set.fna GCA_000001635.9_GRCm39_full_analysis_set.fna > hg38_mm39.fna
mkdir hg38_mm39
cd hg38_mm39
bowtie2-build ../hg38_mm39.fna hg38_mm39.bowtie_index
Specify the locations of your input files (FASTQ and Bowtie index files) and output files, and choose whether to include spike-in normalization in the configuration file config.yml. Detailed explanations for each parameter are included in config.yml. If the experiment includes spike-in, set include_spikein to true, and set index_primary and index_spikein according to the experiment.
To rescale the bigwig file and call peaks based on an input (control) sample, the pipeline requires a metadata table with two columns: ChIP and Input. The sample names should match those in the FASTQ files (e.g., SRR5085155.fastq). If a sample does not have input, exclude it from the table. A python script for generating such a table is provided here, create_frip_table.py.
Once the configuration file is set up, run the following command in the terminal to generate the required BAM/BED files:
snakemake --use-conda --cores $Ncores --configfile config/config.yml
Ensure that your computing resources are available.\ Tips: [Number of cores] = [number of jobs] * [number of process in config.yml]. And, [Number of cores] <= the total number of cpus you have
to create metadata file, run
python fetch_metadata.py config/fetch_metadata_config.yml
after modifying config/fetch_metadata_config.yml
.
Example metadata table:
After you generate bam files and bed files with the above command line, you can specify path to the bed file, input data, and output data in the config file config/create_frip_table_config.yml
, and use create_frip_table.py
to calculate FRiP value,
python create_frip_table.py config/create_frip_table_config.yml
Example FRiP table:
you can use calculate_frip.py
to calculate FRiP value.
python calculate_frip.py --nproc [number of cpus] [pathway to the metadata table]
For example:
python calculate_frip.py --nproc 45 /home1/yxiao977/sc1/frip_sm_data/frip_result/Hansen2017/metadata.txt