Easily Generates RNA structures of short- and long-range intra/intermolecular interactions, and homodimers with an Interactive Graphical User Interface.
A streamlined program for analyzing proximity ligation experiments from mapped files in the fastq/SAM format to:
Additionally, creates an interactive GUI, and plots differences and similarities between experiments.
RNA adopts ensemble of structures essential for life such as splicing, gene expression, and virus replication. It adopts complex secondary structures supported by short-, long-range base-pairing interactions and tertiary structures supported by long-range interactions, pseudoknots, base triples, and RBPs. RNA proximity ligation techniques such as CLASH, PARIS, and COMRADES directly detect RNA-RNA interactions and produces a list of chimeric interactions. However, it is difficult to identify long-range base-pairings as it is often misidentified by minimum folding energy algorithms, and the interpretation of proximity ligation data is challenging as there is a lack of integrated analytical tools.
Hyb2 is a bioinformatics pipeline for the analysis of RNA proximity ligation experiments in high resolution.
It generates RNA structures with experimental support for short- and long-range intramolecular interactions, intermolecular interactions, and RNA homodimers.
It supports commonly used data formats (SAM) and integrates with a variety of mapping and analysis tools.
The Hyb2 pipeline is streamlined to receive input of fastq/SAM files and generates a hyb file with information on RNA-RNA interactions, and various visualizations of data: contact density map, viewpoint graph, and RNA structure as outputs in a single command on the Linux command line.
Creates a GUI to easily select and visualize RNA-RNA interactions from contact density maps, with styling options using VARNA. GUI allows selection of specific interactions directly from the contact density maps, and generating the corresponding colour-coded RNA secondary structure in a VARNA GUI pop-up.
Linux/ Mac Operating System with Miniconda Installed
If Miniconda not installed, read:
https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
blast and bowtie2, if not installed:
wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.14.1+-x64-linux.tar.gz
tar -zxf ncbi-blast-2.14.1+-x64-linux.tar.gz
wget https://github.com/BenLangmead/bowtie2/releases/download/v2.3.4.3/bowtie2-2.3.4.3-linux-x86_64.zip
unzip bowtie2-2.3.4.3-linux-x86_64.zip
VARNA, if not installed, read:
or:
git clone https://github.com/yannponty/VARNA.git
cd VARNA
ant compile
ant jar
ant run
Hyb2 can be downloaded into bin from GitHub with:
git clone https://github.com/Jylau14/hyb2.git
Install Hyb2 using:
cd hyb2/
bin/hyb2_install
On MacOS, install using:
cd hyb2/
bin/hyb2_install_macOS
To run hyb2 using a fastq/sam input file, type in the command line:
hyb2 -i testData.fastq/sam -1 Zika_18S.fasta -o run_1 -a ZIKV-PE243-2015_virusRNA -b NR003286.4_RNA18SN5_rRNA -x 7501 -y 501 -l 500 -j ~/VARNA/build/jar/VARNAcmd.jar
To run the program, the first thing to have is the sequence alignment map (SAM) file or a fastq file.
If the fastq files from RNA proximity ligation experiments are not mapped to reference sequences, this program contain a function to generate a SAM file using bowtie2.
The second file needed will contain the reference fasta sequences.
VARNA is used to visualize the RNA structures.
The following ID format is preferred to provide a more complete set of information on the sequences:
>Gene stable ID version, Transcript stable ID version, Gene name, Gene type
E.g.
Downloaded from BioMart (input fasta can be in this format):
>ENSG00000007372.25|ENST00000638963.1|PAX6|protein_coding
After being formatted in the pipeline:
>ENSG00000007372.25_ENST00000638963.1_PAX6_mRNA
Or see:
https://github.com/Jylau14/hyb2/blob/main/data/Zika_18S_formatted.fasta
Hyb2 environment needs to be activated for essential softwares.
conda activate hyb2
Hyb2
To get familiar with the command line arguements, it could be broadly explained in 3 parts:
-i input_file (fastq/sam)
-1 reference_sequences.fasta used for mapping
-2 2nd_referece_sequences.fasta if different from 1st
-o output_prefix
-v blast_threshold (default value: 0.1)
-m max_overlap (default value: 4)
-h max_hits_per_sequence (default value: 10)
-a gene_ID_of_interest
-b 2nd_gene_ID_of_interest
-q upperlimit_for_heatmap_chimeric_count (default value: 0.95 (%) )
-x start_coord_of_1st_gene (cannot be longer than nucleotide length of 1st gene)
-y start_coord_of_2nd_gene (cannot be longer than nucleotide length of 2nd gene)
-l lengths_of_fragments
-j location of VARNAcmd.jar
-r toggle interactive mode of VARNA (default=1, off, input 0 for VARNA pop up)
RNA Structure Folding from 1001-1500nt positions of Zika virus (ZIKV).
hyb2 -i testData.fastq/sam -1 Zika_18S.fasta -o test_1 -a ZIKV-PE243-2015_virusRNA -x 1001 -l 500 -j ~/VARNA/build/jar/VARNAcmd.jar
RNA Structure Folding of 1001-1500nt positions with 5001-5500nt positions of ZIKV.
hyb2 -i testData.fastq/sam -1 Zika_18S.fasta -o test_2 -a ZIKV-PE243-2015_virusRNA -x 1001 -y 5001 -l 500 -j ~/VARNA/build/jar/VARNAcmd.jar
RNA Structure Folding of 7501-8000nt positions of ZIKV with 501-550nt positions of 18S rRNA.
hyb2 -i testData.fastq/sam -1 Zika_18S.fasta -2 18S.fasta -o test_3 -a ZIKV-PE243-2015_virusRNA -b NR003286.4_RNA18SN5_rRNA -x 7501 -y 501 -l 500 -j ~/VARNA/build/jar/VARNAcmd.jar
Or if reference sequences are contained in the same file:
hyb2 -i testData.fastq/sam -1 Zika_18S.fasta -o test_3 -a ZIKV-PE243-2015_virusRNA -b NR003286.4_RNA18SN5_rRNA -x 7501 -y 501 -l 500 -j ~/VARNA/build/jar/VARNAcmd.jar
RNA Structure Folding of 3501-3700nt positions of ZIKV with 3501-3700nt positions of a second strand of ZIKV.
hyb2 -i testData.fastq/sam -1 Zika_18S.fasta -o test_4 -a ZIKV-PE243-2015_virusRNA -b ZIKV-PE243-2015_virusRNA -x 3501 -y 3501 -l 200 -j ~/VARNA/build/jar/VARNAcmd.jar
To perform randomized parellel RNA structure folding, which folds the RNA 1,000 times, and subsequently scoring each structure, a computer cluster that runs qsub is required.
Taking the analysis of intermolecular interactions as an example:
qsub comradesFold2 -c test_3_ZIKV-PE243-2015_virusRNA-7501-8000_NR003286.4_RNA18SN5_rRNA-501-1000.1-1100_folding_constraints.txt -i ZIKV-PE243-2015_virusRNA-7501-8000_NR003286.4_RNA18SN5_rRNA-501-1000_1-1100.fasta -s 1
comradesScore -i test_3_ZIKV-PE243-2015_virusRNA-7501-8000_NR003286.4_RNA18SN5_rRNA-501-1000.basepair_scores.txt -f ZIKV-PE243-2015_virusRNA-7501-8000_NR003286.4_RNA18SN5_rRNA-501-1000_1-1100.fasta
To create an interactive interface pop-up window:
conda activate hyb2_GUI
hyb2_app -i test_1.entire.txt -h test_1.hyb -a ZIKV-PE243-2015_virusRNA -1 Zika_18S.fasta -j ~/VARNA/build/jar/VARNAcmd.jar
To compare between 2 different proximity ligation experiments, the program incorporates DESeq2 to identify the differential chimeras, and produces a differential coverage map.
To find the conservations between 2 experiments, we look for overlapping interactions.
Up to 4 replicates for each experiment can be used as input, with a minimum of 2.
The input files for hyb2_compare comes from outputs of the main hyb2 pipeline *(entire.txt)**.
For example, to identify the differences and similarities between control and experimental conditions:
hyb2_compare -a control_rep1.entire.txt -b control_rep2.entire.txt -c control_rep3.entire.txt -d control_rep4.entire.txt -i exp_rep1.entire.txt -j exp_rep2.entire.txt -k exp_rep3.entire.txt -l exp_rep4.entire.txt -1 control_rep1.hyb -2 control_rep2.hyb -3 control_rep3.hyb -4 control_rep4.hyb -5 exp_rep1.hyb -6 exp_rep2.hyb -7 exp_rep3.hyb -8 exp_rep4.hyb -0 GENE_NAME -9 ref.fasta -v ~/VARNA/build/jar/VARNAcmd.jar
The automatic plotting of RNA structures can be skipped by omitting some options, and with the command:
hyb2_compare -a control_rep1.entire.txt -b control_rep2.entire.txt -c control_rep3.entire.txt -d control_rep4.entire.txt -i exp_rep1.entire.txt -j exp_rep2.entire.txt -k exp_rep3.entire.txt -l exp_rep4.entire.txt
The first output is a hyb file, that contains sequence identifiers, read sequences, 1-based mapping coordinates, and annotation information for each chimera.
There's 17 columns per read:
Column 1: Unique Sequence Identifier
Column 2: Read Sequence
Column 3: Predicted binding energy in kcal/mol.
Columns 4–9: Mapping information for first fragment of read: name of matched transcript, coordinates in read, coordinates in transcript, mapping score.
Columns 10–15: Mapping information for second fragment of read.
Column 16: Overlap Score
Column 17: Type of Chimera (See chim_types for a visualization of the types of chimera)
https://github.com/Jylau14/hyb2/blob/main/bin/chim_types
Contact Density Maps
The axis are the genome lengths with each spot representing chimeras.
Chimeras ligated in 5'-3' and 3'-5' orientations are plotted above and below the diagonal respectively.
Spots close to the diagonal are short-ranged interactions that can be easily folded into structures, while spots further away are long-ranged interactions.
The contrast of the spots corresponds to the chimeric counts, with darker spots representing a higher count.
Although the contrast is capped at an upper quantile limit (default 95%).
If 2 genes were input in the command line, the nuclotide positions of the first gene will be plotted as the x-axis, and the other on y-axis.
The x-axis represents the nucleotide positions and y-axis represents the frequency of chimeric interactions.
The graph shows the abundance of interactions and its positions along the RNA.
To understand the secondary structures, read:
The structures are colour coded based on log2 of supporting reads, with red being the most supported, blue the least, and blank for none.
(VARNA instructions)
The GUI shows 3 contact density maps. Highlighting the first contact density map (left) will zoom into the region (plotted in the middle). Highlighting an interaction on the second plot will generate a zoomed-in version of the interaction in the third contact density map (right).
The tables show the coordinates of each interaction, with chimeric counts shown. Uncapped counts refer to the actual chimeric counts, and count refer to counts capped at 95% percentile.
There's options to select what types of RNA-RNA interaction to generate, the start X (and Y) coordinates, and length of RNA to fold. X (and Y) coordinates are automatically inputted based on the highlighted interaction, but can also be entered manually.
Clicking the "Fold RNA" button creates a VARNA GUI pop-up with the colour-coded RNA structure generated.
Read similarly to Contact Density Maps.
However, differential coverage maps are plotted in two colours, with red being interactions enriched in one condition and blue for the other.
Also, instead of the contrast reflecting chimeric counts, here, it represents significance (p-value) for significanyly differential interactions.
Similarity heatmaps plot the conserved interactions between datasets.