CraigIDent / SpliSER

Bioinformatic tool for Splice site Strength Estimation using RNA-seq
14 stars 4 forks source link

Splice-site Strength Estimation using RNA-seq


version 0.1.8 (16th November 2022)


SpliSER quantifies the utilisation of splice sites across the genome. Generating a Splice-site Strength Estimate (SSE) for each individual site.

SpliSER has three commands which are used in succession across an analysis: process, combine, and output

process

The process command generates a file containing the SSE values (an associated information) of each splice site observed in an RNA-seq alignment. You will need to run this command once on each individual sample; it will then produce a .SpliSER.bed file which can be taken as input for the combine command.

Which of my samples do I need to process?
All of them. If you have 3 RNA-seq replicates across 2 conditions, you will need to call process once for each of your 6 samples.

What will I need for this step?

  1. You will need your RNA-seq alignment files in BAM format
  2. An accompanying BED file for each BAM.
  3. You may also want an annotation file for your organism (from the same reference genome used for the alignment step) in GFF or GTF format.

parameters

The process command requires the following three input parameters:

Required Parameter Description
-B   --BAMFile The path to an RNA-seq alignment file, in BAM format
-b   --BEDFile The path to a corresponding splice junctions file, in BED12 format
-o   --outputPath The path to a directory (including sample prefix) where the resulting .SpliSER.bed file will be written

There are several optional parameters, which add gene annotations to the output and allow the analysis to be limited to a particular region:

Optional Parameter Description
-A   --annotationFile The path to a GFF or GTF file mathcing the genome to which your RNA-seq data was aligned
-t   --annotationType The type of feature to be extracted from the annotation file. (Default: gene).
--isStranded Include this flag if your RNA-seq data is stranded, prevents opposite-strand reads from contributing to a site's SSE
-s   --strandedType REQUIRED IF USING --isStranded. Strand specificity of RNA library preparation, where \"rf\" is first-strand/RF and \"fr\" is second-strand/FR.
--beta2Cryptic Calculate SSE of sites taking into account the weighted utilisation of competing splice sites as indirect evidence of site non-utilisation (Legacy).
-c   --chromosome Limit the analysis to one chromosome/scaffold, given by name matching the annotation file eg. '-c Chr1'. required if using -g
-g   --gene Limit the analysis to one locus, given by name matching the annotation file eg. '-g ENSMUSG00000024949'. (If using this parameter you must also specify the --chromosome and --maxIntronSize)
-m   --maxIntronSize only required if using -g This is the maximum intron size used in your alignment (If you're unsure, take a maximum intron size for your species eg. '-m 6000' for A.thaliana or '-m 500000' for M.musculus).


Troubleshooting guide for splice juction .bed files

Issues with the above steps can look like the following: Half of the splice sites didn't get assigned to a gene *There are no Beta1 read counts for any splice site

Any issues please don't hesitate to contact me at cdent @ mpipz . mpg . de


Help for this command can also be viewed in terminal using:

python SpliSER_v0.1.8.py process -h



combine

The combine command takes a set of processed files and combines them into a single experiment. Combine interpolates all of the splice-site data across your samples. If a splice-site was observed in one sample but not another, SpliSER will get the associated read counts in that second sample for that site. The combine command produces a .combined.tsv file which holds a complete set of information for all splice-sites observed across all of your samples. This is usually the longest step.

What will I need for this step?

  1. You will need your processed files from the previous step.
  2. You will need to make your own samples file (detailed below). You can make this in a text editor and save it under whatever name you like. I prefer SamplesFile.tsv.
  3. You will need to have a quick look at the first few rows of the annotation file (if you used one), and record what the first genomic region is usually 'chr1' or '1'.

parameters

The combine command requires the following three input parameters:

Required Parameter Description
-S   --samplesFile The path to a user generated file with three tab-separated columns providing information for each of the processed samples to be combined (see example below)
-o   --outputPath The path to a directory (including sample prefix) where the resulting .combined.tsv file will be written

The samplesFile needs to have three columns and no header row. Each row stores the information for one of the samples you wish to combine.

Column content
1 Sample name - this can be whatever you like, so long as it uniquely identifies each sample
2 Path to processed file - this is the path to the processed SpliSER.tsv file for this sample
3 Path to BAM - this is the path to the BAM file for this sample (SPliSER will need to access it if there was a splice-site observed in another sample, which will then need to be assessed in this sample)

Here is an example for combining four RNAseq samples:

Sample1 /path/to/Sample1.SpliSER.tsv  /path/to/bams/Sample1.bam
Sample2 /path/to/Sample2.SpliSER.tsv  /path/to/bams/Sample2.bam
Sample3 /path/to/Sample3.SpliSER.tsv  /path/to/bams/Sample3.bam
Sample4 /path/to/Sample4.SpliSER.tsv  /path/to/bams/Sample4.bam
Optional Parameter Description
--isStranded Include this flag if your RNA-seq data is stranded, prevents opposite-strand reads from contributing to a site's SSE
-s   --strandedType REQUIRED IF USING --isStranded. Strand specificity of RNA library preparation, where \"rf\" is first-strand/RF and \"fr\" is second-strand/FR.
-g   --gene Limit the analysis to one locus eg. '-g ENSMUSG00000024949'(only use this if you also applied the --gene parameter in the previous process step) (Default: All)
--beta2Cryptic Calculate SSE of sites taking into account the weighted utilisation of competing splice sites as indirect evidence of site non-utilisation (Legacy).


Help for this command can also be viewed in terminal using:

python SpliSER_v0.1.8.py combine -h


combineShallow

For GWAS-type analyses, you may wish to combine hundreds (if not thousands) of RNA-seq samples into a single experiment. The standard combine command will keep each file open as it combines them, but large numbers of files may exceed the file handle limit of your operating system. In UNIX systems you can check this limit with:

ulimit -n

In these circumstances you can use the combineShallow command, this works the same as the combine but works around the file handle limit. The downside being this is more memory intensive.

To improve performance you can run this command with optional parameters, which will filter out some low-coverage sites and save time. It is up to you to decide which thresholds are appropriate for your downstream analyses.

Optional Parameter Description
-m   --minSamples For any given splice site, the minimum number of samples passing the --minReads filter in order for a site to be kept in the analysis - default: 0
-r   --minReads The minimum number of reads giving evidence for a splice site needed for downstream analyses - default: 10
-e   --minSSE The minimum SSE for a splice site to count towards the --minSamples filter - default: 0.00

For example: If you don't plan to analyse splice sites which are not supported by 10+ reads in at least 50 samples. You could select "-m 50 -r 10" to skip over these sites during the combineShallow run. If your sample number is in the 1000s, this will considerably speed up the command. Further, if you don't care to analyse sites which vary from 0.00 to 0.002, you can select "-e 0.05" to ignore those sites which never get an SSE above that threshold.



output

The output command takes a combined file and outputs the data in one of two formats for downstream analysis.
What will I need for this step?
By this step you should already have everything you need

parameters

Required Parameter Description
-S   --samplesFile The path to the samples file you used in the previous step, used here to access the sample names
-C   --combinedFile The path to the combined.tsv file that you generated in the previous step
-t   --outputType 'GWAS' for a SpliSE-QTL analysis, 'DiffSpliSER' for traditional comparison between groups
-o   --outputPath The path to a directory (including sample prefix) where the resulting output file will be written
Optional Parameter Description
-r   --minReads The minimum number of reads giving evidence for a splice site in a given sample, below which SpliSER will report NA. The default is 10 reads.
-g   --gene Output a the data only from a particular locus eg. '-g ENSMUSG00000024949'
-m   --minSamples Only when using --outputType GWAS - the minimum number of samples passing the read filter for a splice site file to be written


Help for this command can also be viewed in terminal using:

python SpliSER_v0.1.8.py output -h

diffSpliSER

We provide an R markdown script for identifying statistically significant differences in splice site usage between samples aligned to the same reference genome.

You'll need to edit some of the code to work for your particular sample names/numbers of replicates, just follow the instrcutions in the comments of the R markdown script.

You'll also need to make tab-separated 'target' file with some additional details about the samples that you are analyzing. We've included a template file with dummy values which you can replace with your own.

Here is the basic layout of the 'target' file:

Column content
1 Sample - this can be whatever you like, so long as it uniquely identifies each sample. These should be in the same order as the samples appear in your SpliSER output.
2 Group - Each sample should belong to one of two groups, you can call the groups whatver you like.
3 Library_size - This is the number of mapped reads in your bam files for these samples. This is used for read count normalisation. You can get this by calling samtools flagstat
4 Description - A column where you can make any notes you want, you can also leave these entries blank with empty quotation marks ("").

Further Information

Please don't hesitate to get in contact with me at cdent @ mpipz . mpg . de