an interactive web viewer for reference-projected pangenome graph
VRPG is an interactive web viewer for reference-projected pangenome graph. It naturally supports graphs in reference Graph Fragment Assembly (rGFA) format and for graphs in Graph Fragment Assembly (GFAv1) format VRPG provides a command-line tool named gfa2view to transform the GFA files to a rGFA-like format. VRPG implements a block index system to support navigating the large and complex pangenome upon hundreds of whole genome assemblies in real time. The information about coordinate and copy number of each segment among the graph was stored in an efficient way and can be queried with almost no delay. VRPG aligns the reference nodes along the center line of the viewport, which make the reference genome easy to be recognized. VRPG also provides an intuitive way for genome comparison by highlighting the path of a particular assembly and its orientation on the rendered graph. A website shipping four pangenome graphs (one for yeast and three for human) is available at https://www.evomicslab.org/app/vrpg/. The Saccharomyces cerevisiae pangenome graph was generated using 163 assemblies and The three Homo sapiens pangenome graphs were constructed by HPRC by three different pipelines (Minigraph, Minigraph-CACTUS and PGGB) upon the same dataset with 90 whole genome assemblies. Users can also deploy the web application and view their own data.
Note: The released version 0.1.3 is not the latest. The latest version of VRPG added a new tool named 'GraphAnno', which can be used to create indexed annotation files for reference gene track plot and interactive view of genes which a node overlaps with.
For graph in GFA format the overlap field in link line (overlap between segments) should be specified (in graphs created by Minigraph-CACTUS and PGGB the overlap is generally specified as 0M), or the value will be set to 0 by VRPG-gfa2view. Although whether the overlap is specified doesn't affect the visualization of the graph it may affect the determination of the coordinate of the segment.
The graphs created by Minigraph-CACTUS and PGGB include large amounts of SNPs and INDELs. The structure variations may be covered up by these small variants. VRPG (version >=0.1.3) supplied functions to simplify the graph, i.e. remove nodes related to the small variants (with size < 50 bp). The simplification related option 'non-ref' in combobox means simplifying non-reference nodes. 'all node' means simplifying all nodes including reference and non-reference. 'none' means not simplifying the graph. Now VRPG can be used to visualize variants at different scales and find their coordinates relative to the reference conveniently. Furthermore, the function that serves several pangenomes were added back in the latest version of VRPG.
For cola layout the node size is more proportional to the segment sequence size. But it may take a little longer time to stabilize. When the number of nodes in a window is small cola layout can be tested.
Python3 (>=3.6) and pip environment are required.
# For installing a historical version please access https://github.com/codeatcg/VRPG/releases and download the source code.
# install the latest version
# zlib
# gcc >= 4.9
pip install Django==3.2.4 pybind11
git clone https://github.com/codeatcg/VRPG --recursive
cd VRPG/module
make
# By default the javascript packages that VRPG depends on are loaded from CDN. Users can also host the packages locally.
python create.local.py
sh host.jslib.local.sh local
# switch to load packages from CDN
sh host.jslib.local.sh cdn
The naming scheme of assembly should follow PanSN prefix naming pattern. Briefly, the assembly's name consists of sample name, delimiter, and haplotype name, e.g., sampleA#0. But it's a little looser in VRPG. It's not required that the haplotype name must be numeric, characters are also allowed. When indexing the graph users can define the search depth (VRPG version > 0.1.2) by option ‘--xDep’. In general, the default value can work well. A small value for this option may cause some big bubbles on the rendered graph uncompleted. Owing to the linearity of the reference genome on graph rendered by VRPG the uncompleted bubble and its approximate location relative to the reference genome can still be recognized generally.
The assemblies to graph mapping files are required. If these files do not exist the assembly can't be highlighted in the drawing. These files can be generated by minigraph by using command '-cxasm --vc'. Then run the following command to get files required by VRPG.
Python script/vrpg_preprocess.py --rGFA all.gfa --gafList gaf_file.list --outDir out_folder --index
sample1#H1 sample1.H1.gaf
sample2#0 sample2.0.gaf
sample3#1 sample3.1.gaf
sample3#2 sample3.2.gaf
Run the following command to create pangenome graph and generate files required by VRPG.
Python script/vrpg_preprocess.py –-minigraph '/software/minigraph' --assList ass_file.list –-outDir out_folder --index
sample1#H1 sample1.H1.fa
sample2#0 sample2.0.fa
sample3#1 sample3.1.fa
sample3#2 sample3.2.fa
Note, '/software/minigraph' represents the absolute path of minigraph executable file. Assembly in first line in file ass_file.list will be taken as reference.
For graphs in GFA format that can be processed by VRPG segment names should be numeric. Fortunately, graphs generated by Minigraph-CACUTUS and PGGB have this feature. If the segment names are not numeric users need to modify the graph first. Also notice that all path names in the graph should follow PanSN prefix naming pattern. If the path names don’t obey the rule the graph needs to be modified. This can be avoided by using proper assembly names before constructing the graph. If the graph satisfied the conditions described above run the following command to get files required by VRPG.
module/gfa2view --GFA in.gfa --ref refName --outDir output_dir --index --range 2000 --thread 10
# gfa2view is flexible. Users can also split the process into two steps.
# step 1: transform and calculate coverage
# This step can’t be paralleled.
module/gfa2view --GFA in.gfa --ref refName --outDir output_dir
# step 2: index
# This step can be paralleled.
module/gfa2view --outDir output_dir --index --range 2000 --thread 10
By two steps users can test different options and parameters to index the graph, while avoiding to transform the graph repetitively. But note that the previous indexing results will be covered.
Note, For the current version of 'gfa2view' memory consumption is proportional to number of threads. A trade-off between speed and and memory consumption needs to be considered.
If only a particular set of reference chromosomes or contigs are considered for view the option ‘--refChr’ can be used to save running time. The option only affects the process of indexing. If this option is specified, a file containing the expected chromosomes/contigs with one chromosome/contig per line is required.
# Create files for reference gene track plot
# run 'GraphAnno addRef --help' for help
GraphAnno addRef --inGFF gffFile --chrTrans chrTransFile --upDir upload
# Create files for interactive view of genes with which a node overlaps
# run 'GraphAnno nodeGene --help' for help
GraphAnno nodeGene --gffList gffListFile --upDir upload
Move all output files in directory 'upload' generated during data preparation to the empty folder 'upload' of VRPG. If more than one pangenomes are available users can rename the results directory 'upload' generated during data preparation and then move the renamed directory into the VRPG 'upload' folder.
Start the development server of Django
python3 manage.py runserver
If all is well you will see the output:
Django version 3.2.4, using settings 'primers_project.settings'
Starting development server at http://127.0.0.1:8000/
Note, For large pangenome graph it's better to prepare data in a computing server and then transfer the data to local 'upload' folder of VRPG.
If server is running on a different machine start the server by running
python3 manage.py runserver 0.0.0.0:8000
Please make sure the firewall is closed. Then open http:\<IP of server>:8000/app/vrpg/.
If you are familiar with nginx or apache you can also deploy VRPG by using any of them.
bedtools intersect -wa -wb -a node.pos -b genome.gff3