EBISPOT / efo

Github repo for the Experimental Factor Ontology (EFO)
https://www.ebi.ac.uk/efo/
54 stars 14 forks source link

Cyclic efo.obo part_of relationship #1320

Closed zoependlington closed 2 years ago

zoependlington commented 2 years ago

Parsing seems to get into an infinite loop when expanding on the "part_of" relation with 3.35.0 obo version, although problem may have existed previously as last confirmed working version using the Merck pipeline was 3.28.0.

More information available with efo-users email.

schelhorn commented 2 years ago

Hi Zoë,

I am repeating my original message from the EFO mailing list here for reference and have answered your question below:

Dear EFO User Group and Support,

we at Merck are using the EFO to annotate our NGS sequencing metadata; in the past, we could programmatically work with the EFO graph by using the EFO OBO edition and the ontologyIndex R package like this:

require(ontologyIndex)

efo.path  = base::file.path('efo_3.28.0_20210318.obo')
propagate = c('is_a', 'part_of', 'located_in', 'derives_from', 'derived_from')
efo.dag   = ontologyIndex::get_ontology(efo.path,
                                        propagate_relationships=propagate,
                                        extract_tags='everything')

(command completes within 20s)

However, since at least EFO 3.35, this does not work anymore. Instead, the parsing seems to get into an infinite loop when expanding on the "part_of" relation (all other relationships used above still work, so I am omitting them here):

efo.path  = base::file.path('efo_3.35.0_20210915.obo')
propagate = c('part_of')
efo.dag   = ontologyIndex::get_ontology(efo.path,
                                        propagate_relationships=propagate,
                                        extract_tags='everything')

(command never completes)

I am attaching a copy of the two EFO OBO files, which were downloaded from the EBI EFO website, to this email.

Did anything change in EFO 3.35 concerning the "part_of" relationship, or could perhaps the OBO build of EFO 3.35 contain a cycle when expanding on the "part_of" relationship? Thanks for any insights you may be able to provide.

R Session Info for reproducibility:

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS/LAPACK: /Users/XXXXXX/Development/Frameworks/miniconda3/envs/env-r/lib/libopenblasp-r0.3.12.dylib

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

loaded via a namespace (and not attached):
[1] compiler_4.0.3    backports_1.2.0   here_0.1          ontologyIndex_2.7
[5] rprojroot_1.3-2   tools_4.0.3       renv_0.12.2

You were also asking:

To confirm, the last working version of efo.obo for you that was non-cyclic was 3.28.0? This will help us pinpoint any changes that may have caused this as there are a number of large changes to the ontology that have happened since March 2021 including sa rework of our neoplasm/cancer branch and an improvement to our axioms, including potentially part_of.

Yes, I can confirm that. However, I have not tested any versions between 3.28 and 3.35, so I do not know when the issue arose. Also, I cannot positively confirm that there is a cycle, but that is my current working hypothesis for why the R function runs forever.

I am happy to help debugging this in case you need further support.

schelhorn commented 2 years ago

Hi Zoë,

is there anything that I could do to help solving this issue? Please let me know if there is, I’d be glad to assist.

zoependlington commented 2 years ago

Hi @schelhorn,

Apologies for the delay in this ticket, it is still under investigation.

It would be incredibly helpful to know which version the issue you are seeing first appeared so we can understand where the potential cyclic relationship was created if you are able to test the versions between 3.28 and 3.35.

Thank you.

schelhorn commented 2 years ago

Hi @zoependlington,

the (suspected) infinite loop occurs since (and including) EFO OBO 3.29.

I have executed a parsing test suite over all published EFO OBO editions, with a 60s timeout. Only EFO EBO 3.28 passes, with a parsing time of about 15s:

require(R.utils)
require(ontologyIndex)

for (efo.path in base::list.files('dat/')) {

  base::cat('Processing currently:', efo.path, '\n')
  efo.dag = base::tryCatch({
              R.utils::withTimeout({
                t = base::proc.time()
                ontologyIndex::get_ontology(base::file.path('dat', efo.path),
                                            propagate_relationships='part_of',
                                            extract_tags='everything')
                base::cat('Success parsing\n')
                base::print(base::proc.time()-t)
              }, timeout=60)},
              error=function(e) {base::cat('Failed parsing\n')})
}

and received:

Processing currently: efo_3.28.0.obo
Success parsing
   user  system elapsed 
 15.149   0.076  15.233 
Processing currently: efo_3.29.0.obo 
Failed parsing
Processing currently: efo_3.29.1.obo 
Failed parsing
Processing currently: efo_3.30.0.obo 
Failed parsing
Processing currently: efo_3.31.0.obo 
Failed parsing
Processing currently: efo_3.32.0.obo 
Failed parsing
Processing currently: efo_3.33.0.obo 
Failed parsing
Processing currently: efo_3.34.0.obo 
Failed parsing
Processing currently: efo_3.35.0.obo 
Failed parsing

Also, I used strace to attach to the running R process during one of the failed parsing attempts to see what is happening; strace does not produce any output (i.e., no syscalls), which is compatible with a tight internal (probably infinite) loop.

zoependlington commented 2 years ago

This is very useful, thank you! I will investigate and try to fix the efo.obo file before our December (15th) release.

matentzn commented 2 years ago

I dont know yet the whole answer to the problem, but @balhoff shared some insanely cool cycle detector system with me, which at least narrowed down our search:

EFO:1001869
EFO:0001369
EFO:0000792
EFO:0000989
EFO:0000991
EFO:0000998
EFO:0001073
EFO:0004340
EFO:0003786
EFO:0003787
EFO:0003768
EFO:1001463
EFO:1000697
MONDO:0000432
MONDO:0000916
Orphanet:399839
Orphanet:1941
Orphanet:275534

Some of them are false positives, as I am looking for every possible cycle, and some are totally valid (A part of B, B has part A). But Look at https://www.ebi.ac.uk/ols/ontologies/efo/terms?short_form=EFO_0003768 as an example: it predisposes to itself.

matentzn commented 2 years ago

Maybe you can use this list here to try and find the circles. What I did is a nasty trick: I materialised all part of relations as is a and then used ROBOT:

robot remove -i data/efo.owl --axioms disjoint query --update sparql/direct-links.ru reason --equivalent-classes-allowed none
MONDO:0016556 = Orphanet:226316 = MONDO:0015792
Orphanet:654 = MONDO:0003321 = MONDO:0019004
Orphanet:330197 = MONDO:0043007
EFO:0000218 = MONDO:0007573
EFO:0007589 = BTO:0000332
MONDO:0020587 = Orphanet:329
EFO:0002901 = UO:0000064
Orphanet:319465 = EFO:0000222
Orphanet:90280 = MONDO:0018827
MONDO:0020067 = MONDO:0019956
MONDO:0100237 = Orphanet:209
MONDO:0000700 = Orphanet:569
MONDO:0043003 = EFO:1000660
Orphanet:275777 = Orphanet:422
EFO:0000678 = Orphanet:2989
Orphanet:621 = MONDO:0001117
MONDO:0020704 = Orphanet:97238
Orphanet:99856 = Orphanet:370034
MONDO:0000044 = Orphanet:437
BTO:0001967 = EFO:0005218
Orphanet:91498 = Orphanet:98686
Orphanet:225154 = Orphanet:1576
Orphanet:823 = EFO:0003105
MONDO:0019050 = Orphanet:68364
MONDO:0017264 = Orphanet:461
EFO:0009043 = Orphanet:101330
MONDO:0021034 = Orphanet:79364
Orphanet:250923 = Orphanet:77
MONDO:0000995 = MONDO:0016122
MONDO:0010122 = MONDO:0018896
PO:0009006 = EFO:0000992
Orphanet:99810 = Orphanet:2940
MONDO:0018829 = Orphanet:799
MONDO:0017609 = Orphanet:97369
MONDO:0019142 = Orphanet:738

let me know how it goes.

matentzn commented 2 years ago

These are the part of cycles:

Cycles involving part of:

<http://purl.obolibrary.org/obo/EFO_0000792> --[<http://purl.obolibrary.org/obo/BFO_0000050>]--> <http://purl.obolibrary.org/obo/EFO_0000792>
<http://purl.obolibrary.org/obo/EFO_0000989> --[<http://purl.obolibrary.org/obo/BFO_0000050>]--> <http://purl.obolibrary.org/obo/EFO_0000989>
<http://purl.obolibrary.org/obo/EFO_0000991> --[<http://purl.obolibrary.org/obo/BFO_0000050>]--> <http://purl.obolibrary.org/obo/EFO_0000991>
<http://purl.obolibrary.org/obo/EFO_0000998> --[<http://purl.obolibrary.org/obo/BFO_0000050>]--> <http://purl.obolibrary.org/obo/EFO_0000998>
<http://purl.obolibrary.org/obo/EFO_0001369> --[<http://purl.obolibrary.org/obo/BFO_0000050>]--> <http://purl.obolibrary.org/obo/EFO_0001369>
zoependlington commented 2 years ago

This is great, thanks for your help @matentzn!

@schelhorn, I have removed the last group of part of cycles Nico highlighted and generated a temporary obo file attached to this comment. If possible, could you test to see if this version works? If it does I'll implement it in our official release next week.

efotemp.obo.zip

schelhorn commented 2 years ago

Indeed, the new obo file works! Thanks, @matentzn and @zoependlington!

# Same test suite mentioned above
Processing currently: efotemp.obo 
Success parsing
   user  system elapsed 
 13.530   0.163  13.711 
zoependlington commented 2 years ago

These changes are now in EFO ready for the release on the 15th, so the upcoming release should work with your pipeline. Please let us know if that isn't the case. I'll move this ticket to done now.