DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
246 stars 73 forks source link

Zika Virus is not in your p+h+v pre-made indices? AND Centrifuge-download does not work? #53

Open waywardsyintist opened 7 years ago

waywardsyintist commented 7 years ago

Hello,

Kind of at wits-end with Centrifuge as I've been trying to get it to work with my own database, and NCBI bac & virus, for a long time now. To paraphrase Roseanna Roseannadanna, "Its always something..."

I recently gave it another go with your pre-made indices just to see if I could get it to run at all. Before running a bunch of my samples through, I used centrifuge-inspect to determine if all of my target organisms were indeed in the database. I used centrifuge-inspect and grep for this...

$ centrifuge-inspect --name-table p+h+v > nametable.txt $ grep "Zika" nametable.txt $

From what I can tell, Zika Virus is not in the p+h+v (the pre-made bacteria, viruses, archaea, human index listed on the right margin of your website)? All of my other target organisms (Human papillomavirus Type 132 and Variola virus, for example) are included in this index.

$ grep "Human papillomavirus type 132" nametable.txt 909331 Human papillomavirus type 132

$ grep "Variola" nametable.txt 10255 Variola virus

ALSO...

Since Zika did not seem to be included, I tried using centrifuge-download again, but I get an error. The connection to NCBI's ftp site seems to be blocked or otherwise not good. Below is the error I get...

$ centrifuge-download -o taxonomy taxonomy Downloading NCBI taxonomy ... rsync: failed to connect to ftp.ncbi.nih.gov (130.14.250.7): Connection refused (111) rsync: failed to connect to ftp.ncbi.nih.gov (2607:f220:41e:250::13): Network is unreachable (101) rsync error: error in socket IO (code 10) at clientserver.c(128) [Receiver=3.1.0]

I sent an email to NCBI describing what I was trying to do and asking whether there was an issue on their end or maybe my corporate firewall was the problem. Here is their response...

Hi,

Thanks for writing to us.

The issue is mostly in the http protocol used by the tool. With the switching to HTTPS late last year, NCBI also requires that http access to our ftp site be switched to HTTPS. You will need to contact the Centrifuge code provider for them to update their code to use HTTPS protocol instead.

A minor issue is the ftp.ncbi.nih.gov domain. Even though it may still work for historical reasons, it may not. The domain should be fully specified with .nlm included, aka ftp.ncbi.nlm.nih.gov

Regards,

Tao Tao, PhD NCBI User Services

I dove into the centrifuge-download script to see if I could manually update the web address that the script is pointed to. There was only one place where the web address was listed that didn't have the '.nlm' in it, and that was line 194. I added the '.nlm' to the address on that line, saved and re-compiled, and re-ran....but I got the same error. I didn't see any references to http and/or https in the centrifuge-download source code.

Also, where does one manually retrieve the names.dmp and nodes.dmp files from NCBI? Weren't those files phased out when they updated to the new format without GI numbers?

Any help ironing out these problems would be much appreciated.

Thank you.

fbreitwieser commented 7 years ago

It seems there is currently no complete Zika genome in RefSeq - I found that very surprising, too.

Look at https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi and https://www.ncbi.nlm.nih.gov/assembly/?term=txid64320[Organism:noexp] . Since we only take the latest complete genome, it didn't find its way into the database. I think it is a mistake that that assembly is flagged as 'Scaffold' level assembled - there is only one scaffold, and it replaced an assembly that was flagged as complete.

I will look into the later issue of downloading the RefSeq data. However it won't fix the issue of the missing Zika genome - RefSeq has to be updated for that. However you could add the Zika virus reference genome, and add one entry (NC_012532<tab>64320) to the map file provided to centrifuge-build via the --conversion-table argument.

Also I'll work on providing a Makefile target for a database that includes viral strains from the NCBI viral genome resource.

fbreitwieser commented 7 years ago

Fixed now. Couple of points:

etc.

I'll re-build the standard database next week with all viral genomes.

waywardsyintist commented 7 years ago

Hello,

Re-installed centrifuge, and installed rsync.

When trying to make the p+v index, I got the following error...

jrussellmac:indices jrussell$ make THREADS=4 p+v DONT_DUSTMASK=1 Making: p+v: p+v /Library/Developer/CommandLineTools/usr/bin/make -f Makefile IDX_NAME=p+v [[ -d tmp_p+v ]] && rm -rf tmp_p+v; mkdir -p tmp_p+v Downloading and dust-masking archaea centrifuge-download -o tmp_p+v -d "archaea" -P 4 refseq > \ tmp_p+v/all-archaea.map Downloading ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/assembly_summary.txt ... rsync: failed to connect to ftp.ncbi.nlm.nih.gov: No route to host (65) rsync error: error in socket IO (code 10) at clientserver.c(122) [Receiver=3.0.7] rsync Download failed! Have a look at valid domains at ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq . make[1]: [reference-sequences/all-archaea.fna] Error 1 make: [p+v] Error 2

Also tried 'make THREADS=4 v'. Error is below...

jrussellmac:indices jrussell$ make THREADS=4 v DONT_DUSTMASK=1 Making: v: v /Library/Developer/CommandLineTools/usr/bin/make -f Makefile IDX_NAME=v [[ -d tmp_v ]] && rm -rf tmp_v; mkdir -p tmp_v Downloading and dust-masking viral-any_level centrifuge-download -o tmp_v -d "viral-any_level" -P 4 refseq > \ tmp_v/all-viral-any_level.map viral-any_level is not a valid domain - use one of the following: archaea bacteria fungi invertebrate plant protozoa unknown vertebrate_mammalian vertebrate_other viral make[1]: [reference-sequences/all-viral-any_level.fna] Error 1 make: [v] Error 2

It seems like NCBI isn't liking the way things are named in the MAKEFILE? I tried changing names a bit but got nowhere.

Any insight much appreciated.

Thanks.

waywardsyintist commented 7 years ago

Hello,

Thank you for updates. Do the new p+v indices include Zika?

What if I have my own custom reference fasta, but not the other files. Is there a way to generate the other files needed (conversion table, taxonomy tree, name table) from a custom reference fasta using ncbi software or samtools?

I'm still running into the same downloading error when trying to 'make p+v'. I.e., NCBI doesn't like the link. I double checked that I do have rsync installed.

Thanks, Joe


Joe Russell, Ph.D. www.waywardscientist.com

On Sat, Feb 11, 2017 at 6:32 PM, Florian Breitwieser < notifications@github.com> wrote:

Fixed now. Couple of points:

-

consider installing rsync for faster downloads. The downloads failed because the script falls back to curl/wget when rsync is not installed, and those did not have the address updated from ftp to https

I added several more database targets to the Makefile, including one with only viruses (v) and prokaryotes (p) or the combination (p+v). Try

make THREADS=10 v

etc.

I'll re-build the standard database next week with all viral genomes.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/infphilo/centrifuge/issues/53#issuecomment-279183820, or mute the thread https://github.com/notifications/unsubscribe-auth/ALagPH4zHNcAYSyrfdxD9FngVmlhJIFKks5rbkT6gaJpZM4L7fXL .