antonisdim / haystac

Code repository for the HAYSTAC pipeline
MIT License
13 stars 4 forks source link

Connection refuse issue #13

Closed npsonis closed 2 years ago

npsonis commented 2 years ago

Hi, I get the following when trying to build a database. [bgzip] Could not open ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/430/045/GCF_000430045.1_ASM43004v1/GCF_000430045.1_ASM43004v1_genomic.fna.gz: Connection refused haystac: error: Unable to download assembly ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/430/045/GCF_000430045.1_ASM43004v1/GCF_000430045.1_ASM43004v1_genomic.fna.gz None

Any thoughts?

Also the step "Job 4: Splitting the representative RefSeq table in smaller tables." takes too long every time that I need to make a different database. Since it is the same, is there a way to perform it once, store the smaller tables locally and use them in next database builds?

Finally, please consider to incorporate a new argument that will output the number and which genomes are going to be downoaded at the end, without actually downloading anything. This will help evaluate if a taxon is missing.

antonisdim commented 2 years ago

Hello,

I hope you are doing great and apologies for the late response !

I tried to reproduce your error using the same RefSeq assembly accession as above, but in my case haystac database run to completion and no errors were raised by bgzip or haystac. My intuition would be that the NCBI servers refused the connection for a short period of time, and that's why the error was raised. Even if the database building step would fail because of that, it will automatically be resumed from that step onwards if you re-run the exact same haystac command. If the error still persists on your end please let me know !

Regarding the step where haystac splits the RefSeq representative table into smaller ones, the reason why it takes so long is because haystac is checking whether the plasmids listed in the RefSeq representative table are included in the assembly files linked to the respective bacterial chromosomes, so that their sequences don't get downloaded more than once. In most cases plasmids are included in the assembly files, but there are species for which the plasmid sequences need to be downloaded independently. This step unfortunately is an iterative process and that is why it can take so long.
The smaller tables are indeed stored locally under your database output directory (the relative path is database_out_dir/entrez/). A work-around would be to copy the RefSeq related files from that directory to another database directory, again under the respective entrez subdirectory (e.g. database_out_dir_2/entrez/). Please bear in mind that you are doing this at your own risk and we advise against it, as haystac under the hood employs snakemake, which checks the time stamps of each input and output file that haystac uses or creaes, and therefore it might end up repeating that step. A safer approach, if you would like to build a fresh database, would be to use the --accessions-file flag with a unique species list (and their respective accessions) generated from the db_taxa_accessions.tsv file that can be found under the database output directory (database_out_dir/db_taxa_accessions.tsv).

haystac is still under active development and we always welcome suggestions from users ! Yes an extra step like that sounds helpful, and we'd be happy to implement it in a future version of the package.

Please do let me know if any of the above is unclear, and thank you for your suggestion !

Best, Antony

npsonis commented 2 years ago

Thanks Antony,

the connection error persists but I think it is due to my institute's firewall. Is there a way to work around it own my own? If not, and if I just have to ask my admin, then you may close this issue.

Regarding the second point made, I guess that an option may be added for someone that does not care about the plasmids in order to save time.

Best,

Nikos

antonisdim commented 2 years ago

Hello Nikos,

I hope you are well !

The best I can think of is if you download the sequence you need manually from NCBI and then point to its location using the --sequences-file flag.

That's a great suggestion actually, and we'll definitely try our best to incorporate it in a future version of haystac. Again thank you for your patience, your suggestions and of course for using haystac.

Please do let me know if you run into any other issues !

Best, Antony