KennthShang / PhaGCN

GCN classifier for phage taxanomy classification
GNU General Public License v3.0
27 stars 5 forks source link

Upgrade the database #1

Closed conchaeloko closed 3 years ago

conchaeloko commented 3 years ago

Hi, I would like to know how to proceed to upgrade the database i.e being able to assign contigs to other order than caudoviridales ?

Thanks for the tool you developed.

KennthShang commented 3 years ago

I am sorry that we only support the Caudovirales database on GitHub now. This is because, as mentioned in our paper, there are not enough reference genomes in other phage families. Thus, the result might not be stable if your new classes contain less than 10 genomes for training. In the paper, we show that the current database has the ability to reject contigs from other phages (not in Caudovirales), and also because Caudovirales contains 95.8% known phages, we only supply this database to make sure the prediction of PhaGCN is reliable.

However, if you want to extend the database to the other orders, you can revise the code and database with the following steps:

  1. You need to use the pre-trained CNN model in the 'CNN_Classifier' folder to compress all the node features. The script 'run_CNN.py' will help you to complete the task. The compressed feature file has the same function as "database/dataset_compressF".
  2. You need to change the protein file in the database folder, which is "Caudovirales_protein.fasta". Also you need to supply several files to describe the information about your new database. The format is the same as file: "Caudovirales_genome_profile.csv (which describe how many proteins for each genome)", "Caudovirales_gene_to_genomes.csv (which contains protein id, genome names, and protein names)", and "reference_name_id.csv (which is the corresponding line of each genome in your compressed feature)"
  3. Finally, you need to supply the 'database/taxonomic_label.csv' which is the label for your new classes.

Make sure all the requirements have been updated so that the model can directly use your new database to run. (Also, you might need to change the name of the database file in the code)

conchaeloko commented 3 years ago

Thanks a lot for the quick answer, I'll eventually update if I managed to make it work

Ale-Rossi commented 2 years ago

Hello, I've been trying to follow these instructions in order to use a newer and more complete dataset of viral genomes. I've created a new "database" folder, including updated files named Caudovirales_protein.fasta, Caudovirales_genome_profile.csv, Caudovirales_gene_to_genomes.csv, and taxonomic_label.csv, as requested. First of all, I do not understand what the "class" column refers to in the taxonomic_label.csv file, as it clearly does not refer to the taxonomical rank of class. Not knowing any better, I used the same value as the "family" column. Secondly, I've attempted to generate a compressed feature file with the run_CNN.py script. This script, as I found by digging in the source code, produces the files CNN_Classifier/name_list.csv and the pickled file Cyber_data/contig.F. Now, the first file only contains the header and the second contains an empty numpy array. The scripts themselves do not provide documentation and I have no way of understanding what I could do. Could you please provide a more detailed explanation of how to use different genomes as databases and improve the documentation? Thank you.

KennthShang commented 2 years ago

Hello, I've been trying to follow these instructions in order to use a newer and more complete dataset of viral genomes. I've created a new "database" folder, including updated files named Caudovirales_protein.fasta, Caudovirales_genome_profile.csv, Caudovirales_gene_to_genomes.csv, and taxonomic_label.csv, as requested. First of all, I do not understand what the "class" column refers to in the taxonomic_label.csv file, as it clearly does not refer to the taxonomical rank of class. Not knowing any better, I used the same value as the "family" column. Secondly, I've attempted to generate a compressed feature file with the run_CNN.py script. This script, as I found by digging in the source code, produces the files CNN_Classifier/name_list.csv and the pickled file Cyber_data/contig.F. Now, the first file only contains the header and the second contains an empty numpy array. The scripts themselves do not provide documentation and I have no way of understanding what I could do. Could you please provide a more detailed explanation of how to use different genomes as databases and improve the documentation? Thank you.

Hi Ale-Rossi,

For the first question, directly using the family as the class is ok. The previous classes id is used to balanced the family (because some of the families have much more sequences for training, which might affect the accuracy)

For the second question, the CNN classifier is based on the model used in CHERR. You can check it out.

Besides, our partner help to improve PhaGCN into new version PhaGCN2.0 with the latest database. Also, it can predict more viruses rather than Caudovirales. Maybe you want to take a look.

Best, Jiayu

Ale-Rossi commented 2 years ago

Thank you for your prompt answer, I'm already trying PhaGCN2.0 out. I'd like to reformat some of the code. Unfortunately I don't have much free time, but if I do, I will fork the repo and rework it a bit.

Cheers! Alessandro