Closed conchaeloko closed 3 years ago
I am sorry that we only support the Caudovirales database on GitHub now. This is because, as mentioned in our paper, there are not enough reference genomes in other phage families. Thus, the result might not be stable if your new classes contain less than 10 genomes for training. In the paper, we show that the current database has the ability to reject contigs from other phages (not in Caudovirales), and also because Caudovirales contains 95.8% known phages, we only supply this database to make sure the prediction of PhaGCN is reliable.
However, if you want to extend the database to the other orders, you can revise the code and database with the following steps:
Make sure all the requirements have been updated so that the model can directly use your new database to run. (Also, you might need to change the name of the database file in the code)
Thanks a lot for the quick answer, I'll eventually update if I managed to make it work
Hello,
I've been trying to follow these instructions in order to use a newer and more complete dataset of viral genomes.
I've created a new "database" folder, including updated files named Caudovirales_protein.fasta
, Caudovirales_genome_profile.csv
, Caudovirales_gene_to_genomes.csv
, and taxonomic_label.csv
, as requested.
First of all, I do not understand what the "class" column refers to in the taxonomic_label.csv file, as it clearly does not refer to the taxonomical rank of class. Not knowing any better, I used the same value as the "family" column.
Secondly, I've attempted to generate a compressed feature file with the run_CNN.py
script. This script, as I found by digging in the source code, produces the files CNN_Classifier/name_list.csv
and the pickled file Cyber_data/contig.F
.
Now, the first file only contains the header and the second contains an empty numpy array.
The scripts themselves do not provide documentation and I have no way of understanding what I could do. Could you please provide a more detailed explanation of how to use different genomes as databases and improve the documentation?
Thank you.
Hello, I've been trying to follow these instructions in order to use a newer and more complete dataset of viral genomes. I've created a new "database" folder, including updated files named
Caudovirales_protein.fasta
,Caudovirales_genome_profile.csv
,Caudovirales_gene_to_genomes.csv
, andtaxonomic_label.csv
, as requested. First of all, I do not understand what the "class" column refers to in the taxonomic_label.csv file, as it clearly does not refer to the taxonomical rank of class. Not knowing any better, I used the same value as the "family" column. Secondly, I've attempted to generate a compressed feature file with therun_CNN.py
script. This script, as I found by digging in the source code, produces the filesCNN_Classifier/name_list.csv
and the pickled fileCyber_data/contig.F
. Now, the first file only contains the header and the second contains an empty numpy array. The scripts themselves do not provide documentation and I have no way of understanding what I could do. Could you please provide a more detailed explanation of how to use different genomes as databases and improve the documentation? Thank you.
Hi Ale-Rossi,
For the first question, directly using the family as the class is ok. The previous classes id is used to balanced the family (because some of the families have much more sequences for training, which might affect the accuracy)
For the second question, the CNN classifier is based on the model used in CHERR. You can check it out.
Besides, our partner help to improve PhaGCN into new version PhaGCN2.0 with the latest database. Also, it can predict more viruses rather than Caudovirales. Maybe you want to take a look.
Best, Jiayu
Thank you for your prompt answer, I'm already trying PhaGCN2.0 out. I'd like to reformat some of the code. Unfortunately I don't have much free time, but if I do, I will fork the repo and rework it a bit.
Cheers! Alessandro
Hi, I would like to know how to proceed to upgrade the database i.e being able to assign contigs to other order than caudoviridales ?
Thanks for the tool you developed.