KennthShang / HostG

Graph convolutional neural network for host prediction
19 stars 2 forks source link

Database Extension #5

Closed marineLZ closed 2 years ago

marineLZ commented 2 years ago

Dear authors, thanks for your nice tool. I have some questions:

  1. Is the results in your paper used all prokaryote.csv or the database in github?
  2. If I download genomes in all prokaryote.csv, Do i have to retraining this model?
KennthShang commented 2 years ago

Hi,

  1. if you want to predict the relationships with the known pairs(tested in the paper and you can find them in the Interactiondata folder), then it should be ok using the current database. However, if you want to predict host not exist in these files, you should add more prokaryotic genomes in the folder. For example, if you have a metagenomic data and you know what prokaryote exist in it, then you can place them in the folder.

  2. Yes, the GCN will retrain automatically, you do not need to do anything with the code.

Best, Jiayu

marineLZ commented 2 years ago

then if I don't know what there exists in it,the all prokaryote database will more suitable choices? Could you please give some more detailed instructions on how to download genomes in all_prokaryote.csv using NCBI datasets and then import to HostG?thank you for your kindness.

KennthShang commented 2 years ago

Yes, I think so.

There are two link on the guideline showing how to download the genomes and you can also follow the guideline and place them into the right place.

Best, Jiayu

marineLZ commented 2 years ago

Thank for your prompt reply.

I was somewhat comfused about the database format. It have to one genome for one file?does the filenames matter? for genomes label: GCA_000006905,Proteobacteria,Alphaproteobacteria,Caulobacterales,Caulobacteraceae,Caulobacter I see that the accession id for Cau is AE005673 in NCBI not GCA _000006905( thus i cannot download it successfully in NCBI batchentrez). but in the database of hostG the filenmes is GCA _000006905.fasta while the sequenceid is AE005673. Does it mean that hostG use filename to match information?why use GCA _000006905 instead of AE005673 for seqid and filename?

KennthShang commented 2 years ago

The difference is GCA or GCF files contain both chromosome and plasmid. And AE005673 only contain one of them.

marineLZ commented 2 years ago

Thanks,got it!