Closed marineLZ closed 2 years ago
Hi,
if you want to predict the relationships with the known pairs(tested in the paper and you can find them in the Interactiondata folder), then it should be ok using the current database. However, if you want to predict host not exist in these files, you should add more prokaryotic genomes in the folder. For example, if you have a metagenomic data and you know what prokaryote exist in it, then you can place them in the folder.
Yes, the GCN will retrain automatically, you do not need to do anything with the code.
Best, Jiayu
then if I don't know what there exists in it,the all prokaryote database will more suitable choices? Could you please give some more detailed instructions on how to download genomes in all_prokaryote.csv using NCBI datasets and then import to HostG?thank you for your kindness.
Yes, I think so.
There are two link on the guideline showing how to download the genomes and you can also follow the guideline and place them into the right place.
Best, Jiayu
Thank for your prompt reply.
I was somewhat comfused about the database format. It have to one genome for one file?does the filenames matter? for genomes label: GCA_000006905,Proteobacteria,Alphaproteobacteria,Caulobacterales,Caulobacteraceae,Caulobacter I see that the accession id for Cau is AE005673 in NCBI not GCA _000006905( thus i cannot download it successfully in NCBI batchentrez). but in the database of hostG the filenmes is GCA _000006905.fasta while the sequenceid is AE005673. Does it mean that hostG use filename to match information?why use GCA _000006905 instead of AE005673 for seqid and filename?
The difference is GCA or GCF files contain both chromosome and plasmid. And AE005673 only contain one of them.
Thanks,got it!
Dear authors, thanks for your nice tool. I have some questions: