geneontology / neo

noctua entity ontology
9 stars 2 forks source link

Unknown reduction in entities in NEO build #118

Closed kltm closed 4 months ago

kltm commented 5 months ago

Recent builds of NEO are failing on a sanity check on a reduction of entities: [ 3800000 -gt 3677609 ].

Note the this is a doc count and the actual number of entities found is more on the order of ~1838000, so we would be expecting something like 1900000, instead of the 1838000 we're getting; this is a reduction of something like 62000 entities (out of 1900000).

kltm commented 5 months ago

Tagging @vanaukenk @pgaudet

kltm commented 5 months ago

As we have access to the index, I think the easiest approach would be for me to standup another amigo instance and compare it to the current NEO load, trying to isolate exactly where the drop is occurring (unless somebody happens to know off the top of their head).

kltm commented 5 months ago

Okay, I can now probe directly.

2045614 vs 1843165, so we're looking at an easy way to find ~202449 entities.

Buuut, we have a problem. Looking in AmiGO, only 284394 entities even have namespaces for filtering. The largest change in that remaining set is something like a 1200 reduction in CHEBI, out of over 100k, meaning a 1% shift. Similar for source, which doesn't give us much for exploring.

kltm commented 5 months ago

Filtering all of the above and sampling the 1843165/1558771 entities w/o a namespace (this must be a bug). After some filtering, I believe I can see a ~200k reduction in RNAcentral:

RNAcentral 737750 / 528750

Does anybody have any thoughts about that? Source ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/.gpi/rnacentral.gpi.gz

cmungall commented 5 months ago

@alexsign is your best best for the RNA gpi

alexsign commented 5 months ago

@cmungall @kltm I'm getting the same file and it has 36585897 lines. Source: ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/gpi/rnacentral.gpi.gz

kltm commented 5 months ago

Okay, digging in a little more, the source of our data seems to be https://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/gpi/rnacentral.gpi.gz, which seems to give the same 36585897 lines, so I'm guessing same file. Given that, we're talking about ~37m entities there, and that's clearly nothing to do with the ~2m in the index. It turns out that there is a script we have for rnacentral that filters for

10090
10116
3702
39947
44689
4530
4896
559292
6239
7227
7955
9606

This gets us closer to my expectations:

cat /tmp/rnacentral.gpi | rnacgpi2obo.pl > /tmp/rnac-filtered.txt && cat /tmp/rnac-filtered.txt | grep -c 'id: R'
1052851

but still over the 737750 ish total for "RNAcentral" in the index. Moreover, assuming the releases/ directory is right, the file we're getting has been stable since November--months before this problem started.

kltm commented 5 months ago

Noting that the issue became apparent between Feb 7th and Feb 14th.

kltm commented 4 months ago

@vanaukenk Okay, I've dug in a little more and have some numbers. I'm leaning towards just using the "reduced" load in the future, as it seems like this reflects an actual change in the source.

Playing around with the indexes, we can see some things.

The changes almost all seem to be in RNAcentral, with records of the form:

RNAcentral:URS000234A496_9606 URS000234A496_9606 Hsap

They all conform to this pattern and exist in this identifier space.

Grabbing all RNAcentral identifiers, we have 737750 in our current NEO product and 528750 in the NEO attempt.

Diffing these, there are 213721 identifiers unique to the current NEO product and 4721 to the NEO attempt. Of the 4721 unique to the NEO attempt, 4669 are in a "continuous block" that is numerically after all previous identifiers.

Looking at a random identifier that is in the current NEO product, but not in the NEO attempt product:

https://rnacentral.org/rna/URS0000006BE3/9606

It looks like it has been removed.

While I can't exactly trace everything, as we don't save original files for this, it looks like a bit cleanup was done at their end, which has had the effect of lowering our overall count. What do you think of bumping down the safety numbers and trying this out?

vanaukenk commented 4 months ago

@kltm I'm inclined to say let's go forward with the NEO update in Noctua so that groups who currently curate there can have their latest set of entity identifiers. Have you or anyone else in GO contacted RNACentral directly about this?

cmungall commented 4 months ago

As @kltm says, it looks like these were intentionally removed

E.g https://rnacentral.org/rna/URS0000006BE3/9606

has no genome locations and all the disease annotations say "removed from the database"

I think the MOU for providing entity IDs to GOC should include commitment to track entity IDs, including obsolete IDs

kltm commented 4 months ago

Cheers--there is a little bit about that above. What would the timeline be for the MOU? Meeting or long term? It may be a larger change depending on the resource, but losing remote identifiers will make things hard as time goes on...

pgaudet commented 4 months ago

Hi @blakesweeney

We have lost a large number of RNACentral identifiers in the current release we are loading (note that we only load RNA Central IDs for the taxa listed above): 200,000 fewer than what we previously had, which is about a 20-25% drop. We wanted to confirm with you that this was expected. The issue became apparent between Feb 7th and Feb 14th.

Thanks in advance for your help,

Pascale

blakesweeney commented 4 months ago

Hi thanks for bringing this to my attention. We are just finishing a release now and I can look into the changes after. I think the change will have happened in release 23, as our most recent release was in Novemeber.

For reference RNAcentral never deletes or reuses any identifiers. We always keep the old ones avaiable as above. For ones like URS0000006BE3_9606 where all databases which provide that sequence have stopped they are marked as inactive and no longer exported to GPI files but the ids still exist. If helpful I can look at adding a gpi file for inactive accessions, we actually do this for sequences as well.

kltm commented 4 months ago

It looks like we are accepting this state.