[TOC]
PPanG is a precision genome browser with a new perspective of pangenome. Existing pangenome browsers focus on the variations of nucleotide sequences, including base mutations and structural variations. However, pangenomic studies of the variations at gene level (presence/absence variations (PAV) and gene structural variations (gSV)) just stay on statistics, and the details of variation still remain unknown. PPanG provides nucleotide-accurate visualization of both genome sequences and genome annotations, making it clear to analyze the genomic variations from the nucleotide level to gene level.
PPanG is composed of two subviews: graph view for the whole pangenome in SequenceTubeMap and linear view for each individual genome in JBrowse2. PPanG is easy to get started, but requires efforts to interpret the visualizations, as the visualizations are mixture of different views (graph & linear views) and different tracks (sequence & annotation tracks). So it is strongly recommended to understand the principles of PPanG first. The introduction page is https://cgm.sjtu.edu.cn/PPanG/.
PPanG is divided into three areas: Navigation Area
, Visualization Area
and Functional Area
, and sd-1 gene region is visualized by default:
Next we will explain these three areas in details.
Navigation Area allows users to provide the custom region for visualization besides the default sd-1 gene region. The navigation is very simple within three steps:
Select the target chromosome at Data
. Pangenome graphs built by both MC and PGGB are available for users to choose.
Select the navigation type. There are three navigation types available:
a) Navigation Type=built-in genes
Just select from the built-in genes. e.g.The navigation of GS5
on chr05:
b) Navigation Type=reference annotation gene ID
Input the target MSU RGAP7 gene ID and select the gene ID from the list. e.g. The navigation of LOC_Os08g03060
on chr08:
c) Navigation Type=custom region
Input the target region by coordinates. Format: <sample name.chrxx>:<start coordinate>-<end coordinate>
. In this step any individual can be selected as reference. The "Custom Path" box provides the list of available paths. Type and select the path name (related samples will be suggested if typing into the box), and "Region" is automatically filled with the selected path name. e.g. Data=chr11_mc
, Navigation Type=custom region
, Custom Path=CHAOMEO.chr11
and fill the region with CHAOMEO.chr11:6541923-6546025
:
Click the "Go" button and wait a minute for loading. The visualization of Visualization Area
will be updated.
There are several functional buttons beside "Go":
(start - offset, end - offset)
. offset
is half the length of target region.(start + offset, end + offset)
. offset
is half the length of target region.FAQ
for more details.The visualization of PPanG is composed of SequenceTubeMap graph view and JBrowse2 linear views. For interpretation of PPanG visualization, please refer to the PPanG homepage: https://cgm.sjtu.edu.cn/PPanG/. The custom options for visualization are available at Functional Area
.
In the graph view, the pangenome graph of nine representative genomes are visualized by default. The graph is ultra long because each nucleotide is visualized, and the graph can be translated by mouse-dragging, zoomed by mouse-scrolling and compressed by clicking the "Compress" button at Navigation Area
.
The zoom behavior of "Zoom" button at Navigation Area
and mouse-scrolling is different, because sometimes the zoom is only required at horizontal direction (compress the graph):
Above the graph is the coordinate axis of the reference (not necessarily the reference genome, but the reference selected in the target region). In SequenceTubeMap, the coordinate axis may be longer than the target region, and the actual start and end coordinates are marked in yellow circles. For example, in the figure below, the yellow circle represents the start coordinate 6,541,923:
By default, the linear view of reference is visible below the graph view. Linear views for other individuals are simply added by double-clicking the paths in graph view. The usage of linear view is the same as native JBrowse2 component. (The more detailed document of JBrowse2 is available at https://www.jbrowse.org/jb2/docs/.)
In PPanG, the graph view and linear views are combined in parallel. The connection among different views are shared coordinates. Different linear views are all aligned to the graph view according to start and end coordinates. Besides, the coordinate change in graph view will trigger synchronous changes in linear views (The opposite is unavailable, because the linear views are free to extend outside the target region while the graph view is not). Consequently, exons at the same position from different views represents the same exon. Note that the coordinate alignment cannot be exactly perfect, because the coordinates of graph view are not uniformly distributed with occasional variations.
Functional Area contains several additional features helpful to users:
As the list of reference gene IDs is available in PPanG, it is easy to navigate to the known gene region. But it does not work for novel sequences, including distributed genes absent in reference genome but present in other genomes. Therefore, a BLAT server is embedded in PPanG, which helps the locating of novel sequences. Input the target sequence and click the "Search" button, the BLAT server will automatically process the alignment result and provide the target region. By default, the novel sequence is searched within the nine reference genomes, and the search range can be expanded to all individuals by selecting the "Search all genomes".
Then input the "tRegion" (target region) of BLAT result as
Navigation Type=custom region
in 2c to visualize the custom region in PPanG.
The Annotation Data
tab collects genome annotations within the target region. Click the "Download All" button to download these annotation data for each individual.
The legends of all aligned genome tracks in graph view are shown in this tab. By default, the nine reference genomes are selected to show, and users are able to select any individual to add it in the pangenome graph. To visualize all individuals, please click the "Select all" and wait a moment. The large pangenome graph may cost much time and memory.
There are some options to adjust the visualization in PPanG.
For SequenceTubeMap view:
For JBrowse2 views:
Other options for JBrowse2 are available by clicking the button in the top-left corner of each linear view.
The visualization of reference annotation genes (e.g. LOC_Os08g03060
) is simple with the steps below:
Data
;Navigation Type
and type the gene ID LOC_Os08g03060
into MSU RGAP7 Gene ID
;This section we will describe the whole user path for visualization of the distributed gene Xa7.
Xa7 is known to be absent in the reference genome, so the Navigation Type=reference annotation gene ID
is unable for navigation. Then, the BLAT server is necessary to locate the Xa7 region. The Xa7 sequence is input into the BLAT server to find the target "tRegion":
As the BLAT result shows, there is only one match with the "tRegion" of
NATELBORO.chr06:28873554-28874897
. Then the region is provided in the Navigation Area
step 2c:
Data
;Navigation Type
and copy the tRegion NATELBORO.chr06:28873554-28874897
into Region
;The visualization is shown below:
As this figure shows, only one track is visible for nine reference genomes. That is, Xa7 sequence is only present in one genome and absent in the other eight genomes. All aligned tracks are listed in the
Legend
tab:
Click the "Select all" button to visualize all individuals and click the "Compress" button at
Navigation Area
for a global overview. Double click some of the paths in graph view to add them into linear views. The final visualization is as follows:
clone the repo and install dependencies:
git clone git@github.com:SJTU-CGM/PPanG.git
cd PPanG/
npm install # or yarn install
tabix (https://github.com/samtools/tabix) and vg (https://github.com/vgteam/vg) are also needed in your PATH
.
The configuration of SequenceTubeMap view is in src/config.json
, dataPath
should be set to your own data folder (in PPanG dataPath
is riceData/
) and DATA_SOURCES
correspond to the xg files in your dataPath
. The reference is set in reference
. The name
, alias
and annotation
of reference
can be the same. bedFile
is only available if vg chunks are pre-processed, otherwise it should be removed. Other detailed configuration is available in SequenceTubeMap.
All genomes and GFF3 annotations are needed in bgzip format with tabix
index (.fasta.gz, .fasta.gz.gzi, .fasta.gz.fai, .gff.gz, *.gff.gz.tbi) in jbrowse/
folder.
Unfortunately, the total size of all 113 genomes in PPanG is over the max limit of the BLAT server. So indeed, these genomes are divided into 11 parts of database. The BLAT server searches part1 (for nine references) or all parts (for all genomes) for the query sequence and collects the results together. That explains why searching all genomes may exceed the time limit. We do not recommend deploying the BLAT server as PPanG does. For novel sequences, just use sequence alignment tool like BLAST at user's local machine and ignore the BLAT server.
builtin_genes.json
and reference_genes.json
are required in src/
folder. The format is:
{
"<data.xg>": {
"<gene_id>": "<gene_region>",
...
},
...
}
For example, the builtin_genes.json
in PPanG is like:
{
"chr01_mc.xg": {
"sd1 (LOC_Os01g66100)": "IRGSP-1.0.chr01:38382381-38385503",
"Pish (LOC_Os01g57340)": "IRGSP-1.0.chr01:33141126-33145608",
"Gn1a (LOC_Os01g10110)": "IRGSP-1.0.chr01:5270102-5275677"
},
"chr02_mc.xg": {
"tgw2 (LOC_Os02g52550)": "IRGSP-1.0.chr02:32155212-32156481",
"GW2 (LOC_Os02g14720)": "IRGSP-1.0.chr02:8114960-8121924",
"OsGL1-4 (LOC_Os02g40784)": "IRGSP-1.0.chr02:24718046-24724119",
"tms5 (LOC_Os02g12290)": "IRGSP-1.0.chr02:6397341-6399235",
"EP3 (LOC_Os02g15950)": "IRGSP-1.0.chr02:9042075-9046140",
"RSS1 (LOC_Os02g39390)": "IRGSP-1.0.chr02:23769389-23772841",
"OsSKIPa (LOC_Os02g52250)": "IRGSP-1.0.chr02:31995217-31997696"
},
...
}
A: This bug occurs possibly due to the network failure. Click the "reload" button will solve the problem.
A: The library used by "Download Image" button is designed for daily use, when users hardly need to download a so large graph. It is not suitable for the large pangenome data. As an alternative, we recommend a better way to download the large graph with a native tool in the Edge/Chrome browser. Press F12
to open the developer tools. Select the Elements
tab, double-click the <div id="root">
, and double-click the <div>
below. Then, right-click the <div id="Pangenome browser">...</div>
and click Capture node screenshot
. Then the browser will automatically download the large graph into xx.png
.
A: Actually, the BLAT server uses caches to store the alignment results. Although one request exceeds the time limit, the caches are still saved in the server. If the same request comes, the BLAT server will use cashes to avoid redundant calculations. So it is feasible to click the "Search" button again and again until success.
A: From the perspective of the whole pangenome graph, one genome should correspond to only one path exactly. But in a directed cyclic graph, one path may be truncated into different pieces within a target region. We show a simple demo in the figure below:
There are totally two paths in this graph. However, if the right part of this graph is extracted as marked in the figure, there will be many broken lines in the subgraph. Actually, how to extract a pretty subgraph from the original graph is also a challenge to pangenome graph algorithms.