VEuPathDB / ApiCommonData

Apache License 2.0
3 stars 3 forks source link

Species reconciliation - bug in common ancestor code #48

Closed bobular closed 4 months ago

bobular commented 6 months ago

Where B and C both have common ancestor A, the algorithm in ./Load/bin/reconcilePopBioSpecies.pl returns A.

But when you ask for the common ancestor of A and B (in that same hypothetical tree) it returns the parent of A, let's call it Z. It should be A.

Note - when this is fixed we should do a reload of all studies with multiple species assays per sample. This is not a simple EDA query to figure out (it would require a 'count_of_children' derived variable). "Pyretophorus" is one of the offending reconciliation outputs, it is present in VBP0000004 and VBP0000005. Aha... these reconciliation outputs have species qualifier = ambiguous - filtering on that gives us the studies we should reload!

2007-UCDavis-UCLA            VBP0000005    UC Davis/UCLA population dataset
2009-Costantini-Burkina-Faso VBP0000006    Living at the edge: biogeographic patterns of habitat segregation conform to speciation by niche expansion in Anopheles gambiae
2012-Donnelly-M-S-IR         VBP0000004    Genotyping and permethrin-resistance phenotyping of Anopheles gambiae M and S form mosquitoes from Cameroon, Ghana, Uganda and Guinea-Bissau
2012-conn-peru               VBP0000835    Molecular Taxonomy of Anopheles (Nyssorhynchus) benarrochi (Diptera: Culicidae) and Malaria Epidemiology in Southern Amazonian Peru
2016-indian-icemr            VBP0000162    Increasing the potential for malaria elimination by targeting zoophilic vectors
Ag1000G_AR3                  VBP0000163    Natural diversity of the malaria vector Anopheles gambiae
bobular commented 5 months ago

I went to diagnose the Perl script and thought it would be wise to add a new test to the crude test suite that is part of the script. But when I ran the existing tests, the first "common ancestor" test seemed to get stuck in an infinite loop. I then ran the offending SQL in sqldeveloper and, not surprisingly, it hung again (and I have to ask Ana to kill it for me, which is annoying).

https://github.com/VEuPathDB/ApiCommonData/blob/5ab42d0f01ee478ed3aa5ec90eee83ad938746d5/Load/bin/reconcilePopBioSpecies.pl#L390-L405

with r1(subject_term_id, object_term_id, query_term_id, lvl, external_database_release_id) as (
  select subject_term_id, object_term_id, subject_term_id as query_term_id, 1 as lvl, external_database_release_id
  from sres.ontologyrelationship r
  where subject_term_id in (4362, 4288) and external_database_release_id = 1394
  union all
  select r2.subject_term_id, r2.object_term_id, query_term_id, lvl+1, r2.external_database_release_id
  from sres.ontologyrelationship r2, r1
  where r2.subject_term_id = r1.object_term_id
    and r2.subject_term_id != r2.object_term_id -- prevents circular issues
    and r2.external_database_release_id = r1.external_database_release_id
)
select object_term_id
from r1
group by object_term_id
having count(distinct query_term_id) = 2
order by avg(lvl);

Here are some more details if you would like to play around yourselves: gus.config has dbiDsn=dbi:Oracle:eda-inc

The commandline to run the tests (don't worry, it won't write anything to the db):

./Load/bin/reconcilePopBioSpecies.pl --test --extDbRlsSpec "ISATab_VBP0000005_RSRC|2023-08-11" --veupathOntologySpec "OntologyTerm_popbio_taxonomy_RSRC|dontcare" 
bobular commented 5 months ago

My first thought was that there could be circular paths (not just the subject=object condition already tested for) so I adapted it to avoid visiting any term ID more than once

WARNING THIS WILL HANG TOO

with r1(subject_term_id, object_term_id, query_term_id, path, path_length, external_database_release_id) as (
  select subject_term_id, object_term_id, subject_term_id as query_term_id, ',' || subject_term_id || ',' as path, 1 as path_length, external_database_release_id
  from sres.ontologyrelationship r
  where subject_term_id in (4362, 4288) and external_database_release_id = 1394
  union all
  select r2.subject_term_id, r2.object_term_id, query_term_id, r1.path || r2.subject_term_id || ',', r1.path_length + 1, r2.external_database_release_id
  from sres.ontologyrelationship r2, r1
  where r2.subject_term_id = r1.object_term_id
    and r2.external_database_release_id = r1.external_database_release_id
    and instr(r1.path, ',' || r2.subject_term_id || ',') = 0 -- make sure no nodes are visited twice
)
select object_term_id
from r1
group by object_term_id
having count(distinct query_term_id) = 2
order by avg(path_length);

But interestingly, running it without the order by is enlightening. This won't hang if you run it in sqldeveloper. It will let you page through the results.

with r1(subject_term_id, object_term_id, query_term_id, path, path_length, external_database_release_id) as (
  select subject_term_id, object_term_id, subject_term_id as query_term_id, ',' || subject_term_id || ',' as path, 1 as path_length, external_database_release_id
  from sres.ontologyrelationship r
  where subject_term_id in (4362, 4288) and external_database_release_id = 1394
  union all
  select r2.subject_term_id, r2.object_term_id, query_term_id, r1.path || r2.subject_term_id || ',', r1.path_length + 1, r2.external_database_release_id
  from sres.ontologyrelationship r2, r1
  where r2.subject_term_id = r1.object_term_id
    and r2.external_database_release_id = r1.external_database_release_id
    and instr(r1.path, ',' || r2.subject_term_id || ',') = 0 -- make sure no nodes are visited twice
)
select object_term_id, query_term_id, path_length, path
from r1;
bobular commented 5 months ago

To start with, there are five duplicates of every relationship with path length 1

object_term_id, query_term_id, path_length, path
--
4250    4288    1   ,4288,
4250    4288    1   ,4288,
4250    4288    1   ,4288,
4250    4288    1   ,4288,
4250    4288    1   ,4288,
4092    4362    1   ,4362,
4092    4362    1   ,4362,
4092    4362    1   ,4362,
4092    4362    1   ,4362,
4092    4362    1   ,4362,

Then there are 25 rows of each of the path length 2 paths

4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,
4249    4288    2   ,4288,4250,

And then 125 of the path length 3 paths, 625 of path length 4 and so on

So it's probably not getting into a loop but it's ballooning and getting out of hand (and doesn't complete in 30 minutes at least).

I suspect that the ontologyrelationship table has been populated 5 times. When I wrote the code, there was only one copy and it worked fine. If we have to reload the PopBio data or load a new dataset with multiple species assays per sample, the reconcile species script will hang and the workflow will stop.

While writing this I figured the solution is to make a temporary copy of sres.ontologyrelationship filtered for external_database_release_id and made distinct for all the columns we are using.

I'm not a SQL wizard (and definitely not Oracle) - please could I have some help coding this?