bioperl / bioperl-db

BioPerl BioSQL ORM
http://bioperl.org
Other
9 stars 12 forks source link

crashes on downloading NCBI records #3

Open cjfields opened 9 years ago

cjfields commented 9 years ago

Author Name: Dmitry Samborskiy (Dmitry Samborskiy) Original Redmine Issue: 2213, https://redmine.open-bio.org/issues/2213 Original Date: 2007-02-20 Original Assignee: Bioperl Guts


Hi All,

I tried to use modern BioPerl-DB code for storing NCBI virus genome records in MySQL database (I used to use BioPerl-1.0 before).

I prepare a fresh BioSQL database with loaded NCBI taxonomy database. Then I try to store some genomes.

To my surprise it crashes with following diagnostic:

Could not store AY593801: ——————- EXCEPTION ——————- MSG: The supplied lineage does not start near ‘Foot-and-mouth disease virus - type A’ (I was supplied ‘Foot-and-mouth disease virus | Aphthovirus | Picornaviridae’) STACK Bio::Species::classification /tmp/perl/lib/perl5/site_perl/5.8.6/Bio/Species.pm:174 STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD /tmp/perl/lib/perl5/site_perl/5.8.6/Bio/DB/Persistent/PersistentObject.pm:552 STACK Bio::DB::BioSQL::SpeciesAdaptor::populate_from_row /tmp/perl/lib/perl5/site_perl/5.8.6/Bio/DB/BioSQL/SpeciesAdaptor.pm:281 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object /tmp/perl/lib/perl5/site_perl/5.8.6/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:1305 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key /tmp/perl/lib/perl5/site_perl/5.8.6/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:973 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key /tmp/perl/lib/perl5/site_perl/5.8.6/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:852 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /tmp/perl/lib/perl5/site_perl/5.8.6/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:182 STACK Bio::DB::Persistent::PersistentObject::create /tmp/perl/lib/perl5/site_perl/5.8.6/Bio/DB/Persistent/PersistentObject.pm:244 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /tmp/perl/lib/perl5/site_perl/5.8.6/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:169 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /tmp/perl/lib/perl5/site_perl/5.8.6/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /tmp/perl/lib/perl5/site_perl/5.8.6/Bio/DB/Persistent/PersistentObject.pm:271 STACK (eval) ./load_seqdatabase.pl:620 STACK toplevel ./load_seqdatabase.pl:602


The AY593801 genome was downloaded recently from NCBI site. In my script I used only standard scripts from BioPerl-DB/BioSQL packages that makes all the test steps (db creation/taxonomy loading/sequence loading /…). See the attachment archive.

It seems that the problem originates from the fact that AY593801’s taxonomy (taxon:12111) has a special form:

mysql> SELECT name.name, node.node_rank FROM taxon node, taxon taxon, taxon_name name WHERE name.taxon_id = node.taxon_id AND taxon.left_value BETWEEN node.left_value AND node.right_value AND taxon.taxon_id = 12111 AND name.name_class = ‘scientific name’ ORDER BY node.leftvalue; ——————————————————————-_—————-+ | name | noderank | ——————————————————————-—————-+ | root | no rank | | Viruses | no rank | | ssRNA positive-strand viruses, no DNA stage | no rank | | Picornaviridae | family | | Aphthovirus | genus | | Foot-and-mouth disease virus | species | | Foot-and-mouth disease virus - type A | no rank | ——————————————————————-_—————-+ 7 rows in set (0.30 sec)

One can see that “Foot-and-mouth disease virus” is a prefix of “Foot-and-mouth disease virus - type A”. I found that in all these cases BioPerl-DB crashes.

Apparently, it’s unexpected consequence of the “massaging” trick in the SpeciesAdaptor::populate_from_row method. Indeed, the problem has gone when I commented out the trick code and commented accordingly the species correction in the mysql::SpeciesAdaptorDriver::insert_object method. See my patch code:

—- Bio/DB/BioSQL/SpeciesAdaptor.pm.ORIG 2006-12-06 19:46:51.000000000 +0 300 *_+ Bio/DB/BioSQL/SpeciesAdaptor.pm 2007-02-20 23:08:44.000000000 +0300 @ -256,6 +256,8@ while($clf->[0]>[1] && ($clf>[0]->[1] eq “no rank”)) { shift(`$clf); } +# skip this corrections (see also mysql/SpeciesAdaptorDriver::insert_object): +=pod

in the species object we store the species element without the

        # genus, and similarly for the sub-species and variant
        for(my $i = scalar(`$clf)–2; $i >= 0; $i—) {

@ -278,6 +280,7@ $rank = $clf->[scalar(`$clf)-1]->[1]; }

done massaging, store away

+=cut $obj->classification([reverse(map { $->[0]; }`$clf)], “FORCE”); } if($rows->[4] && (! $obj->classification)) { —- Bio/DB/BioSQL/SpeciesAdaptorDriver.pm.ORIG 2006-12-06 19:46:51.000000000 +0 300 *+ Bio/DB/BioSQL/SpeciesAdaptorDriver.pm 2007-02-20 23:10:00.000000000 +0 300 @ -312,7 +312,8@ $clf[0]->[1] = “species”; $clf[1]->[1] = “genus”;

also, convention is species to equal the binomial, not just first name

Another one crash case occures with my old NC_001846 record which was downloaded few years ago (see NC_001846.seq file in the archive). In this case ORGANISM line doesn’t match NCBI taxonomy for the supplied taxon:11138 (it did correspond several years ago, though).

Perhaps, BioPerl-DB code is not intelligent enough to find and fix this discrepancy. And my patch does NOT help there.

Many thanks in advance.

Best regards, Dmitry Samborskiy

cjfields commented 9 years ago

Original Redmine Comment Author Name: Dmitry Samborskiy Original Date: 2007-02-20T16:04:15Z


Created an attachment (id=569) the archive with test.sh script and all needed data + proposed patch

cjfields commented 9 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2007-02-20T16:40:27Z


This is related to bug 2092; we may mark this as a duplicate at a later point.

Have you tried using the CVS version of bioperl or just Bio::Species to see if it fixes this bug? The error actually comes from Bio::Species in bioperl core.

cjfields commented 9 years ago

Original Redmine Comment Author Name: Dmitry Samborskiy Original Date: 2007-02-21T09:01:09Z


(In reply to comment #2)

This is related to bug 2092; we may mark this as a duplicate at a later point.

I agree if it helps to solve the problem. Though, my bug report contains simple and reproducable test script.

Have you tried using the CVS version of bioperl or just Bio::Species to see if it fixes this bug?

I checked that the current CVS tree stores the same *Adaptor modules.

The error actually comes from Bio::Species in bioperl core.

I don’t understand it. In my opinion the problem originates from strange postprocessing tricks that are performed at storing/retrieval steps (in Bio/DB/BioSQL/SpeciesAdaptor.pm and Bio/DB/BioSQL/SpeciesAdaptorDriver.pm modules). These tricks create taxonomy records that are incompatible with stored NCBI taxonomy database. After applying my patch all the NCBI up-to-date genome records have been stored without crashes.

Another problem is crashe with really old NCBI records that use have ORGANISM lineage that contradicts to their NCBI /db_xref=taxon:####.

I don’t know how it could happen, but I have a lot of such records (NC_001846.seq is one of them - see the attachment).

May be it’s not difficult to make the BioPerl-DB code be compatible with these old-fashioned records as well (e.g. by fixing its taxonomy implicitly ?).

Many thanks in advance.

Regards, Dmitry

cjfields commented 9 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2007-02-21T11:24:57Z


The postprocessing in SpeciesAdaptor does mess things up in some cases. The issue is directly related to recent changes in Bio::Species and and could be taken care of by simply not running any postprocessing and foregoing the lineage checking altogether in Bio::Species::classification(), where the exception occurs.

However, I believe doing so may break functionality with older bioperl-db/BioSQL installations since data is stored based on the older Bio::Species system (single-name genus and species). Maybe Hilmar can comment?

cjfields commented 9 years ago

Original Redmine Comment Author Name: Dmitry Samborskiy Original Date: 2007-02-21T13:43:38Z


To summarize your point, for now there is no consensus in understanding where to fix the problem?

By the way, can anybody explain why the Bio::DB::BioSQL::mysql::SpeciesAdaptorDriver::insert_object has the substitution (commented by my patch):

$clf[0]>[0] = $obj>binomial();

As I understood it is leading to the term duplication which is then cured by “massaging” code at Bio::DB::BioSQL::SpeciesAdaptor::populate_from_row (which I also had to comment) ?

Sorry if I misunderstood code structure — I used to work with BioPerl-DB v.1.0 (which was much more simple…).

cjfields commented 9 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2007-02-22T10:24:15Z


(In reply to comment #5)

To summarize your point, for now there is no consensus in understanding where to fix the problem?

By the way, can anybody explain why the Bio::DB::BioSQL::mysql::SpeciesAdaptorDriver::insert_object has the substitution (commented by my patch):

$clf[0]>[0] = $obj>binomial();

As I understood it is leading to the term duplication which is then cured by “massaging” code at Bio::DB::BioSQL::SpeciesAdaptor::populate_from_row (which I also had to comment) ?

Sorry if I misunderstood code structure — I used to work with BioPerl-DB v.1.0 (which was much more simple…).

First, you must remember that rel. 1.5.2 is still considered a developer release. bioperl-db is also still considered under active development, though it should be stable enough for most purposes thanks to Hilmar.

The bug stems from significant changes re: Bio::Species in the latest bioperl release, outlined in the latest BioPerl release notes (sorry if the text wraps here):

http://www.bioperl.org/wiki/Core\_1.5.2\_new\_features\#Species.2C\_taxonomy\_overhaul

The issue raised by Sendu in bug 2092 is that the newer, simpler behavior of Bio::Species (outlined in the link above) often clashes with bioperl-db, which still expects the old behavior (single name genus/species designations) via SpeciesAdaptor. SpeciesAdaptor parses them from the lineage pulled from preloaded taxonomy (if present) and attempts to reset the Bio::Species::classification() array, which triggers the exception since it now expects to find the species name in the lineage; this single exception is the source of all the problems.

Several modifications have been made by Sendu and I to Bio::Species recently in CVS to allow for more flexibility and also uses warn() instead of throw() (so it should work if you update Bio::Species from CVS):

http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/Bio/Species.pm?cvsroot=bioperl

My inclination is to remove the Bio::Species exception altogether to maintain compliance with bioperl-db, and design an optional bioperl-db TaxonAdaptor (based on SpeciesAdaptor) which exhibits the newer behavior. How to go about adding this functioality in is another matter as I’m not a bioperl-db wiz…

cjfields commented 9 years ago

Original Redmine Comment Author Name: Dmitry Samborskiy Original Date: 2007-02-23T03:54:08Z


(In reply to comment #6)

First, you must remember that rel. 1.5.2 is still considered a developer release. bioperl-db is also still considered under active development, though it should be stable enough for most purposes thanks to Hilmar.

You are right. Indeed I assumed 1.5.2 to be a stable release, the “Bioperl 1.5.2 has been released!” message was so inviting (http://www.bioperl.org/wiki/Main\_Page) that I’ve overlooked “developer release” warning. It was also confusing that the last version marked as really stable was released almost 4 years ago. So, I decided that 1.5.2 is a milestone release which is ready to use…

Thank you for your explanation, I’ll try to read thoroughly the code and the threads you mentioned. I think for now I’d better to use my correction until the problem will be solved by BioPerl community. Then I could fix taxonomy records in our database if the final solution will differ from my solution.

Many thanks for your attention!

cjfields commented 9 years ago

Original Redmine Comment Author Name: Sendu Bala Original Date: 2007-02-23T04:01:18Z


(In reply to comment #7)

(In reply to comment #6)

First, you must remember that rel. 1.5.2 is still considered a developer release. bioperl-db is also still considered under active development, though it should be stable enough for most purposes thanks to Hilmar.

You are right. Indeed I assumed 1.5.2 to be a stable release, the “Bioperl 1.5.2 has been released!” message was so inviting (http://www.bioperl.org/wiki/Main\_Page) that I’ve overlooked “developer release” warning. It was also confusing that the last version marked as really stable was released almost 4 years ago. So, I decided that 1.5.2 is a milestone release which is ready to use…

It is ready to use. ‘Developer release’ in this context means only that the API of certain modules may change on short notice. It is the most stable release in terms of bugs. So, getting it right /before/ the next ‘stable’ release is very important, incase some API or behaviour change is needed to solve it well. Thank you for taking the time to report this bug.

cjfields commented 9 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2007-09-03T21:46:14Z


I have retried this using an update bioperl-live and bioperl-db and can’t reproduce the error. Can you try an update?

cjfields commented 9 years ago

Original Redmine Comment Author Name: Dmitry Samborskiy Original Date: 2007-09-04T18:02:53Z


(In reply to comment #9)

I have retried this using an update bioperl-live and bioperl-db and can’t reproduce the error. Can you try an update?

I see. I’ve checked that it works due to changes in Bio::Species::classification() procedure. See lines 166-167 where the if() condition was changed (became less strict: ‘ne’ was replaced by ‘!~’).

Also $self->throw("The supplied lineage does not start near ‘$name … was turned into $self->warn("The supplied lineage does not start near ’$name … at line 175.

But I am not sure that this bug fix was correct (not just a casual consequence of if() condition improvements).

Also I see that the second test case (download of old NC_001846 record) still issues the same message (but now it’s a warning - not a crash, thanks to the ‘throw’ => ‘warn’ substitution made in line 175).

As I wrote before, in my opinion, the origin of the problem lies somewhere deeper: in the modern DB storage scheme used for taxonomy data (apparently, this scheme design led to necessity of “massaging” tricks in the SpeciesAdaptor::populate_from_row). I’m afraid that some logic fault could still trigger some problems later.

@ -163,7 +163,8@

make sure the lineage contains us as first or second element

(lineage may have subspecies, species, genus …)

my $name = $self->node_name;

cjfields commented 9 years ago

Original Redmine Comment Author Name: Dmitry Samborskiy Original Date: 2007-09-04T18:05:03Z


Sorry for the patch-like garbage at the end of my last post.

cjfields commented 9 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2007-09-04T18:46:21Z


(In reply to comment #10)

(In reply to comment #9)

I have retried this using an update bioperl-live and bioperl-db and can’t reproduce the error. Can you try an update?

I see. I’ve checked that it works due to changes in Bio::Species::classification() procedure. See lines 166-167 where the if() condition was changed (became less strict: ‘ne’ was replaced by ‘!~’).

Also $self->throw(“The supplied lineage does not start near ’$name … was turned into $self->warn(”The supplied lineage does not start near ‘$name … at line 175.

But I am not sure that this bug fix was correct (not just a casual consequence of if() condition improvements).

Also I see that the second test case (download of old NC_001846 record) still issues the same message (but now it’s a warning - not a crash, thanks to the ’throw’ => ‘warn’ substitution made in line 175).

As I wrote before, in my opinion, the origin of the problem lies somewhere deeper: in the modern DB storage scheme used for taxonomy data (apparently, this scheme design led to necessity of “massaging” tricks in the SpeciesAdaptor::populate_from_row). I’m afraid that some logic fault could still trigger some problems later.

Right, I fully agree, particularly with the last bit. I still haven’t heard from any of the other devs on this (particularly Hilmar).

I may check into the other Bio* projects to see what they are doing re: BioSQL.

cjfields commented 9 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2008-03-02T17:42:02Z


BioSQL is proceeding towards a 1.0 release; we can probably work on getting the bioperl-db bugs fixed after we clean core up.

cjfields commented 9 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2008-11-29T15:34:38Z


Pushing to 1.6 bioperl-db point release.

cjfields commented 9 years ago

Original Redmine Comment Author Name: Chris Fields Original Date: 2008-11-29T15:37:33Z


This no longer crashes loading but still produces warnings.