Closed bobular closed 4 months ago
I went to diagnose the Perl script and thought it would be wise to add a new test to the crude test suite that is part of the script. But when I ran the existing tests, the first "common ancestor" test seemed to get stuck in an infinite loop. I then ran the offending SQL in sqldeveloper and, not surprisingly, it hung again (and I have to ask Ana to kill it for me, which is annoying).
with r1(subject_term_id, object_term_id, query_term_id, lvl, external_database_release_id) as (
select subject_term_id, object_term_id, subject_term_id as query_term_id, 1 as lvl, external_database_release_id
from sres.ontologyrelationship r
where subject_term_id in (4362, 4288) and external_database_release_id = 1394
union all
select r2.subject_term_id, r2.object_term_id, query_term_id, lvl+1, r2.external_database_release_id
from sres.ontologyrelationship r2, r1
where r2.subject_term_id = r1.object_term_id
and r2.subject_term_id != r2.object_term_id -- prevents circular issues
and r2.external_database_release_id = r1.external_database_release_id
)
select object_term_id
from r1
group by object_term_id
having count(distinct query_term_id) = 2
order by avg(lvl);
Here are some more details if you would like to play around yourselves:
gus.config has dbiDsn=dbi:Oracle:eda-inc
The commandline to run the tests (don't worry, it won't write anything to the db):
./Load/bin/reconcilePopBioSpecies.pl --test --extDbRlsSpec "ISATab_VBP0000005_RSRC|2023-08-11" --veupathOntologySpec "OntologyTerm_popbio_taxonomy_RSRC|dontcare"
My first thought was that there could be circular paths (not just the subject=object condition already tested for) so I adapted it to avoid visiting any term ID more than once
WARNING THIS WILL HANG TOO
with r1(subject_term_id, object_term_id, query_term_id, path, path_length, external_database_release_id) as (
select subject_term_id, object_term_id, subject_term_id as query_term_id, ',' || subject_term_id || ',' as path, 1 as path_length, external_database_release_id
from sres.ontologyrelationship r
where subject_term_id in (4362, 4288) and external_database_release_id = 1394
union all
select r2.subject_term_id, r2.object_term_id, query_term_id, r1.path || r2.subject_term_id || ',', r1.path_length + 1, r2.external_database_release_id
from sres.ontologyrelationship r2, r1
where r2.subject_term_id = r1.object_term_id
and r2.external_database_release_id = r1.external_database_release_id
and instr(r1.path, ',' || r2.subject_term_id || ',') = 0 -- make sure no nodes are visited twice
)
select object_term_id
from r1
group by object_term_id
having count(distinct query_term_id) = 2
order by avg(path_length);
But interestingly, running it without the order by
is enlightening. This won't hang if you run it in sqldeveloper. It will let you page through the results.
with r1(subject_term_id, object_term_id, query_term_id, path, path_length, external_database_release_id) as (
select subject_term_id, object_term_id, subject_term_id as query_term_id, ',' || subject_term_id || ',' as path, 1 as path_length, external_database_release_id
from sres.ontologyrelationship r
where subject_term_id in (4362, 4288) and external_database_release_id = 1394
union all
select r2.subject_term_id, r2.object_term_id, query_term_id, r1.path || r2.subject_term_id || ',', r1.path_length + 1, r2.external_database_release_id
from sres.ontologyrelationship r2, r1
where r2.subject_term_id = r1.object_term_id
and r2.external_database_release_id = r1.external_database_release_id
and instr(r1.path, ',' || r2.subject_term_id || ',') = 0 -- make sure no nodes are visited twice
)
select object_term_id, query_term_id, path_length, path
from r1;
To start with, there are five duplicates of every relationship with path length 1
object_term_id, query_term_id, path_length, path
--
4250 4288 1 ,4288,
4250 4288 1 ,4288,
4250 4288 1 ,4288,
4250 4288 1 ,4288,
4250 4288 1 ,4288,
4092 4362 1 ,4362,
4092 4362 1 ,4362,
4092 4362 1 ,4362,
4092 4362 1 ,4362,
4092 4362 1 ,4362,
Then there are 25 rows of each of the path length 2 paths
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
4249 4288 2 ,4288,4250,
And then 125 of the path length 3 paths, 625 of path length 4 and so on
So it's probably not getting into a loop but it's ballooning and getting out of hand (and doesn't complete in 30 minutes at least).
I suspect that the ontologyrelationship
table has been populated 5 times. When I wrote the code, there was only one copy and it worked fine. If we have to reload the PopBio data or load a new dataset with multiple species assays per sample, the reconcile species script will hang and the workflow will stop.
While writing this I figured the solution is to make a temporary copy of sres.ontologyrelationship
filtered for external_database_release_id
and made distinct
for all the columns we are using.
I'm not a SQL wizard (and definitely not Oracle) - please could I have some help coding this?
Where B and C both have common ancestor A, the algorithm in
./Load/bin/reconcilePopBioSpecies.pl
returns A.But when you ask for the common ancestor of A and B (in that same hypothetical tree) it returns the parent of A, let's call it Z. It should be A.
Note - when this is fixed we should do a reload of all studies with multiple species assays per sample. This is not a simple EDA query to figure out (it would require a 'count_of_children' derived variable). "Pyretophorus" is one of the offending reconciliation outputs, it is present in VBP0000004 and VBP0000005. Aha... these reconciliation outputs have
species qualifier = ambiguous
- filtering on that gives us the studies we should reload!