DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
727 stars 273 forks source link

lookup unmapped accession numbers through NCBI eutils #22

Open ghbore opened 6 years ago

ghbore commented 6 years ago

Hi. I have encountered hundreds of unmapped accessions, for examples NZ_LS483329, when building database from virus, bacteria, archaea, fungi, protozoa and human reference genomes, with est, gb, gss and wgs accession2taxid. In fact these accessions exists. Is it possible to implement extra accession mapping through NCBI eutils in lookup_accession_numbers.pl? I have tried that, and it works.

thbtmntgn commented 6 years ago

Hi,

I have the same problem, I think the solution is to include the dead_nucl and dead_wgs accession2taxid files from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/ when taxonomy is downloaded.

When I'm doing that, it increase the time of lookup_accession_numbers.pl script. @ghbore do you have the same problem? (see #14

ghbore commented 6 years ago

Hi, @thbtmntgn. I just checked the example accession number NZ_LS483329 in dead_nucl and dead_wgs, but did not find it there. I have no idea why this accession exists in NCBI website and EUtils feedback, while absent from archive accession2taxid.

I didn't include dead_* yet, but I can imagine the linear increase on the lookup time as the subject accession db increased.

DerrickWood commented 6 years ago

Hi @ghbore - I might be willing to add the EUtils stuff, but I've always been a little wary of it due to the high number of requests that apps like Kraken would need to make in the general case, and NCBI's (somewhat understandable) banning of IPs that hammer their servers. If you can supply me with a bit of example code to do the lookup, I'll work with it, and run it by some people at NCBI to see if they can be sure it's a safe thing to run as part of the Kraken code distribution.

ghbore commented 6 years ago

Thank you for your time. The following is my code, accepting unmapped.txt as input.

#! /usr/bin/env perl
use strict;
use warnings;
use XML::Simple;
use LWP::Simple;
use Data::Dumper;

use constant EUTILS => 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils';
use constant ESEARCH => EUTILS . '/esearch.fcgi';
use constant ESUMMARY => EUTILS . '/esummary.fcgi';

sub esearch ($$){
  my $rst;
  1 while !defined($rst = get(sprintf("%s?db=%s&term=%s", ESEARCH, $_[0], $_[1])));
  $rst;
}
sub esummary ($$){
  my $rst;
  1 while !defined($rst = get(sprintf("%s?db=%s&id=%s", ESUMMARY, $_[0], $_[1])));
  $rst;
}

sub getGiByAcc ($){
  my $rst = esearch('nuccore', $_[0]);
  die "no record for $_[0]\n" unless defined $rst;
  my $xml = XMLin($rst, ForceArray=>['Id']);
  if ($xml->{Count} == 0){ return; } 
  else {return @{$xml->{IdList}{Id}}; } 
}

sub getInfoById ($){
  my $rst = esummary('nuccore', $_[0]);
  die "no record for $_[0]\n" unless defined $rst;
  my $xml = XMLin($rst, KeyAttr => {Item=>'Name'});
  my $item = $xml->{DocSum}{Item};
  (accession => $item->{Caption}{content},
    version => $item->{AccessionVersion}{content},
    taxid => $item->{TaxId}{content},
    gi => $xml->{DocSum}{Id});
}

while (<>){
  chomp;
  my %h;
  for my $gi (getGiByAcc $_){
    %h = getInfoById $gi;
    print join("\t", @h{qw/accession version taxid gi/}), "\n";
  }
}
thbtmntgn commented 6 years ago

Hi @ghbore

You're right, I tried to add dead*.accession2taxid files but I still have unmapped accession numbers. I'm looking forward to test your fix!

AzizHN commented 1 year ago

@ghbore @thbtmntgn @DerrickWood Did you manage it ? I am facing the same problem here, so many unmapped reads, I tried with fix_unmapped.py and I tried with the manual download of dead*.accession2taxid but no difference.

unholyparrot commented 11 months ago

I have the same problem, I think the solution is to include the dead_nucl and dead_wgs accession2taxid files from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/ when taxonomy is downloaded.

That worked even 5 years later, many thx!

ksavhughes commented 11 months ago

I've used @ghbore's fix before and added the dead_nucl and dead_wgs accession2taxid files to the taxonomy folder. Usually a combination of the two works, but this time around it didn't. I've never had much luck with fix_unmapped.py. I ended up using the efetch Entrez-Direct command to get all my unmapped accessions fixed.

efetch -db nuccore -input unmapped.txt -format docsum | xtract -pattern DocumentSummary -element Caption AccessionVersion TaxId Gi >> added.accession2taxid