PhaGCN2 is a GCN based model, which can learn the species masking feature via deep learning classifier, for new virus taxonomy classification. To use PhaGCN2, you only need to input your contigs to the program.
Update Log (August 29, 2024)
--outpath
is the output file path of your result.To better show the results of the network result, we can now export the total network file final_network.ntw. In order to distinguish the test points of different batches, the test node ID are renamed as "test"+ betches number +sequence ID,such as "test_0_NC_001825.1".
Our database has now been updated based on the latest ICTV classification tables. In order to make it easier for users to view the sequence classification in our database, we put the specific sequence information at database/VMR_based_on_ICTV.csv Due to changes to Caudoviricetes in the new version of ICTV(remove Siphoviridae,Myoviridae and Podoviridae), we have two extension of the methods:
According to our test, the extension version of PhaGCN still remain high performance for the new ICTV labels at family level. For PhaGCN2.0, there maybe some misclassifications in some genus or subfamilies of Caudoviricetes(Bronfenbrennervirinae,Nclasvirinae,Benedictvirus,Fromanvirus,Kroosvirus,Triavirus,Turbidovirus,Veracruzvirus).In fact, the PhaGCN2 results suggest that they are very similar to other genus, such as Gladiatorvirus and Backyardiganvirus.
Our web server for phage-related tasks (including phage identification, taxonomy classification, lifestyle prediction, and host prediction) is available! You can visit PhaBOX to use the GUI. We also provided more detailed intermediate files and visualization for further analyzation. A stand-alone version of PhaBOX is also available via GitHub version, and you can run all these tools at once. Hope you will enjoy it!
All these packages can be installed using Anaconda.
If you want to use the gpu to accelerate the program:(if you want to train your own virus classification database,these packages must be install)
We recommend you to install all the package with Anaconda. After cloning this respository, you can use anaconda to install the environment.yaml. This will install all packages you need with gpu mode (make sure you have installed cuda on your system). We recommend you to install all the package with Anaconda.The command that you need to run is
cd PhaGCN2.0
conda env create -f environment.yaml -n phagcn2
You need to prepare the database before using it.
cd database
tar -zxvf ALL_protein.tar.gz
cd ..
and you can use it to make virus classification.
Before you use it each time, you need to run is
conda activate phagcn2
export MKL_SERVICE_FORCE_INTEL=1
Here we present an example to show how to run PhaGCN2. We support a file named "contigs.fa" in the Github folder and it contain contigs simulated from E. coli phage. The only command that you need to run is
python run_Speed_up.py --contigs contigs.fa --len 8000
There are two parameters for the program:
--contigs
is the path of your contigs file. --len
is the length of the contigs you want to predict.
As shown in our paper, with the length of contigs increases, the recall and precision increase. We recommend you to choose a proper length according to your needs. The default length is 8000bp.
The shortest length supported is 1700bp.The output file is final_prediction.csv. There are three column in this csv file: "contig_name, median_file_name, prediction".Now,the given database can support prediction under the all viruses which is base on ICTV 2021 year reporter. In prediction result,we add a prediction result named "Family_like" , if your virus species prediction label is "_like", it indicates that your virus and some viruses in the virus library are the same order but different families of the relationship. In the Network folder will generate a network map file, you can use this file to draw your unique and beautiful network map
Now we support that you can train your own virus classification database. If you want train your own virus classification database, follow these steps.
First of all,you need with gpu mode (make sure you have installed cuda on your system)and run
pip install bio
pip install torch
sudo apt install prodigal
cd CHEER
sh creat.sh
Preprocess your data set:
$ bash code/re_train_script.sh
Train your data set CNN model
$ python3 train.py --n 8 --gpus 1 --weight "1,1,1,1,1,1,1,1"
--n
is the number of your families,--weight
is the weight coefficient,The number of numbers in weight is equal to the number of n, --gpu
is the number of Gpus you have
This will produce two files Embed.pkl and Params.pkl,Replace the two files with the same name in the CNN_Classifier folder.It requires around 250GB of memory(The larger the data set, the more memory is required).
CHEER folder.Then run:
$ python3 deal_all_pre.py
Please modify your data set until no label error is reported(The label should contain at least one space). Take the sequence number and corresponding family name in a TXT text and named taxa.txt (separated by tabs)then place it in the CHEER folder.Run:
$ python3 deal_result.py`
It generates a folder of result,in this folder replaces the first line of code.txt with line 159 of run_GCN.py in the body of PhaGCN, and the second line with line 643 of run_Knowledgegraph.py,The other five files replace each of the five files in the database folder.
--n
in step two
Copy all_simple_pre.fasta to your PhaGCN folder and run():
$ python3 pre_train.py --contig all_simple_pre.fasta --len 1700
After running it, rename contig.F in the Cyber_data folder to dataset_compressF and replace the file with the same name in the database folder.
(See the CHEER/train_example folder for an example)
If you want to use PhaGCN, you need to take care of three things:
export MKL_SERVICE_FORCE_INTEL=1
before runing run_Speed_up.pyhow to cite this tool:
Jiayu Shang, Jingzhe Jiang, Yanni Sun, Bacteriophage classification for assembled contigs using graph convolutional network, Bioinformatics, Volume 37, Issue Supplement_1, July 2021, Pages i25–i33, https://doi.org/10.1093/bioinformatics/btab293
Jing-Zhe Jiang, Wen-Guang Yuan, Jiayu Shang, Ying-Hui Shi, Li-Ling Yang, Min Liu, Peng Zhu, Tao Jin, Yanni Sun, Li-Hong Yuan, Virus classification for viral genomic fragments using PhaGCN2, Briefings in Bioinformatics, 2022;, bbac505, https://doi.org/10.1093/bib/bbac505