cultivarium / GenomeSPOT

Predict oxygen, temperature, salinity, and pH preferences of bacteria and archaea from a genome
https://cultivarium.org/
MIT License
27 stars 1 forks source link

No entry for file ending in '_protein.faa.gz' #8

Open TBacchetta opened 2 months ago

TBacchetta commented 2 months ago

Hi

I've tried to follow all the steps of the tutorial but I'm facing an issue to download genomes from genbank. The genbank_accessions.txt file created during the previous step contains GCAXXX ids. Since most of them are also in refseq (with GCF accession numbers), it seems that the GCA versions of the .faa are not available anymore for several genomes (they probably only kept the GCF version). This is why I have lots or error messages. To solve this I'm trying to replace GCA by GCF in the accession.txt file, to download from Refseq instead of Genbank, and then I'll replace GCF by GCA in all the file, I hope it's gonna work this way. The download from RefSeq is running and I have no issue for now.

Do you think the pipeline is gonna work if I do this? Capture d’écran du 2024-04-03 09-49-31

tylerbarnum commented 2 months ago

I believe I encountered the same issue with downloading genomes from Genbank and simply assumed the database was incomplete.

If the files all start with "GCA_", there certainly won't be an issue. But I don't believe you need to do that. The code should work so long as the protein and DNA files for the same genome have the same accession. Let me know if the behavior is otherwise.

TBacchetta commented 2 months ago

I started from scratch by downloading the GCAs and it does indeed work.