geneontology / neo

noctua entity ontology
9 stars 2 forks source link

MGI miRNA identifiers unavailable in NEO? #70

Closed vanaukenk closed 3 years ago

vanaukenk commented 3 years ago

Messaging with @hdrabkin

It appears that MGI miRNA identifiers are not available in Noctua.

I've checked on noctua-amigo and can't find them there either.

Here's an example:

MGI MGI:3711324 Mir291a microRNA 291a mmu-mir-291a|Mirn291a gene taxon:10090

kltm commented 3 years ago

Is this actually found in the GPIs as of Friday last week? The NEO build setup does not currently allow plumbing the history of the build.

vanaukenk commented 3 years ago

Asking @hdrabkin now

vanaukenk commented 3 years ago

MGI doesn't archive their daily gpi files, but there's otherwise no indication that the ncRNAs weren't there.
@hdrabkin is checking annotation stats on the gpad files.

vanaukenk commented 3 years ago

Stats from the GPAD don't shed any additional light on this, unfortunately.

kltm commented 3 years ago

If it's there now, we'd normally wait until Friday and see what happens. Is this something where we want to run sooner?

vanaukenk commented 3 years ago

If possible, yes. @hdrabkin and his summer student were trying to annotate these in Noctua and weren't able to.

ukemi commented 3 years ago

They are in the production GPI at MGI and they are in the MGI GPI on snapshot. From snapshot:

MGI MGI:3619057 Mir100 microRNA 100 mir 100|Mirn100|mmu-mir-100 gene taxon:10090
MGI MGI:1920394 Mir100hg Mir100 Mirlet7a-2 Mir125b-1 cluster host gene 2610203C20Rik|3110039I08Rik|D230004N17Rik protein taxon:10090
MGI MGI:2676803 Mir101a microRNA 101a Mirn101|Mirn101a|mmu-mir-101a gene taxon:10090
MGI MGI:3618696 Mir101b microRNA 101b mir-101b|Mirn101b|mmu-mir-101b gene taxon:10090
MGI MGI:3619058 Mir103-1 microRNA 103-1 mir 103-1|Mirn103-1|mmu-mir-103-1 gene taxon:10090
MGI MGI:3619059 Mir103-2 microRNA 103-2 mir 103-2|mir-103-2|Mirn103-2|mmu-mir-103-2 gene taxon:10090
MGI MGI:3718453 Mir105 microRNA 105 Mirn105|mmu-mir-105 gene taxon:10090
MGI MGI:3619120 Mir106a microRNA 106a mir 106a|Mirn106a|mmu-mir-106a gene taxon:10090
MGI MGI:3619060 Mir106b microRNA 106b mir 106b|Mirn106b|mmu-mir-106b gene taxon:10090
MGI MGI:3619063 Mir107 microRNA 107 mir 107|Mirn107|mmu-mir-107 gene taxon:10090
MGI MGI:3619064 Mir10a microRNA 10a MicroRNA-10a|mir 10a|miR-10a|Mirn10a|mmu-mir-10a gene taxon:10090
MGI MGI:2676804 Mir10b microRNA 10b mir-10b|Mirn10b|mmu-mir-10b gene taxon:10090
MGI MGI:3783360 Mir1186 microRNA 1186 Mirn1186|mmu-mir-1186 gene taxon:10090
MGI MGI:3783365 Mir1192 microRNA 1192 Mirn1192|mmu-mir-1192 gene taxon:10090
MGI MGI:3783368 Mir1195 microRNA 1195 Mirn1195|mmu-mir-1195 gene taxon:10090
MGI MGI:3783369 Mir1196 microRNA 1196 Mirn1196|mmu-mir-1196 gene taxon:10090
MGI MGI:3783371 Mir1198 microRNA 1198 Mirn1198|mmu-mir-1198 gene taxon:10090
MGI MGI:2676805 Mir122 microRNA 122 Mir122a|Mir122b|Mirn122a|Mirn122b|mmu-mir-122|mmu-mir-122a gene taxon:10090
MGI MGI:3764925 Mir1224 microRNA 1224 Mirn1224|mmu-mir-1224 gene taxon:10090
MGI MGI:1917691 Mir124-2hg Mir124-2 host gene (non-protein coding) 2610100L16Rik|Gm2612 gene taxon:10090
MGI MGI:4834215 Mir1247 microRNA 1247 mmu-mir-1247 gene taxon:10090
MGI MGI:2676807 Mir124a-1 microRNA 124a-1 Mirn124a|Mirn124a-1|mmu-mir-124-1|mmu-mir-124a-1 gene taxon:10090
MGI MGI:2442197 Mir124a-1hg Mir124-1 host gene (non-protein coding) A930011O12Rik|Rncr3 gene taxon:10090
MGI MGI:3618700 Mir124a-2 microRNA 124a-2 mir-124a-2|Mirn124a-2|mmu-mir-124-2|mmu-mir-124a-2 gene taxon:10090
MGI MGI:3618704 Mir124a-3 microRNA 124a-3 mir-124a-3|Mirn124a-3|mmu-mir-124-3|mmu-mir-124a-3 gene taxon:10090
MGI MGI:2676809 Mir125a microRNA 125a Mirn125a|mmu-mir-125a gene taxon:10090
MGI MGI:2676810 Mir125b-1 microRNA 125b-1 Mirn125b|Mirn125b-1|mmu-mir-125b-1 gene taxon:10090
MGI MGI:3618706 Mir125b-2 microRNA 125b-2 mir-125b-2|Mirn125b-2|mmu-mir-125b-2 gene taxon:10090
MGI MGI:2676811 Mir126a microRNA 126a Mirn126|mmu-mir-126|mmu-mir-126a gene taxon:10090
MGI MGI:5562755 Mir126b microRNA 126b mmu-mir-126b gene taxon:10090
MGI MGI:2676812 Mir127 microRNA 127 Mirn127|mmu-mir-127 gene taxon:10090
MGI MGI:2676813 Mir128-1 microRNA 128-1 Mirn128|Mirn128-1|Mirn128a|mmu-mir-128-1|mmu-mir-128a gene taxon:10090
MGI MGI:3618709 Mir128-2 microRNA 128-2 mir 128b|mir-128b|Mirn128-2|Mirn128b|mmu-mir-128-2|mmu-mir-128b gene taxon:10090
MGI MGI:2676814 Mir129 microRNA 129 Mirn129 gene taxon:10090
MGI MGI:2676815 Mir129-1 microRNA 129-1 Mirn129-1|Mirn129b|mmu-mir-129-1 gene taxon:10090
MGI MGI:3618711 Mir129-2 microRNA 129-2 Mirn129-2|mmu-mir-129-2 gene taxon:10090
MGI MGI:2676816 Mir130a microRNA 130a Mirn130|Mirn130a|mmu-mir-130a gene taxon:10090
MGI MGI:3618716 Mir130b microRNA 130b mir 130b|Mirn130b|mmu-mir-130b gene taxon:10090
MGI MGI:2676817 Mir132 microRNA 132 Mirn132|mmu-mir-132 gene taxon:10090

ukemi commented 3 years ago

The plot thickens. If I use @vanaukenk 's trick of using the identifer (MGI:3619059) I can choose it but it doesn't resolve to the gene symbol.

ukemi commented 3 years ago

It autocompletes to an IRI.

kltm commented 3 years ago

That's what I'm seeing over here too--is seems that these are maybe getting transformed in some strange way? Is there something different about these? microRNA?

kltm commented 3 years ago

Hm, I'm wondering if it's https://github.com/geneontology/neo/blob/master/bin/fix-obo-uris.pl I'm not understanding why it might be a subset though. Would you have any thoughts about this @balhoff ?

Also looking at https://github.com/geneontology/neo/commit/bc93eef7fc23b8050801d25235a53bd7b69fd4e7

ukemi commented 3 years ago

Here is our output GPI 2.0 file: Go seems to be interpreting it ok for the snapshot pipeline.

MGI:MGI:3619057 Mir100 microRNA 100 mmu-mir-100|mir 100|Mirn100 SO:0001263 NCBITaxon:10090
MGI:MGI:1920394 Mir100hg Mir100 Mirlet7a-2 Mir125b-1 cluster host gene 2610203C20Rik|D230004N17Rik|3110039I08Rik SO:0001263 NCBITaxon:10090
MGI:MGI:2676803 Mir101a microRNA 101a Mirn101|mmu-mir-101a|Mirn101a SO:0001263 NCBITaxon:10090
MGI:MGI:3618696 Mir101b microRNA 101b mmu-mir-101b|mir-101b|Mirn101b SO:0001263 NCBITaxon:10090
MGI:MGI:4950030 Mir101c microRNA 101c mmu-mir-101c SO:0001263 NCBITaxon:10090
MGI:MGI:3619058 Mir103-1 microRNA 103-1 mmu-mir-103-1|mir 103-1|Mirn103-1 SO:0001263 NCBITaxon:10090
MGI:MGI:3619059 Mir103-2 microRNA 103-2 mmu-mir-103-2|mir 103-2|mir-103-2|Mirn103-2 SO:0001263 NCBITaxon:10090
MGI:MGI:3718453 Mir105 microRNA 105 mmu-mir-105|Mirn105 SO:0001263 NCBITaxon:10090
MGI:MGI:3619120 Mir106a microRNA 106a mmu-mir-106a|mir 106a|Mirn106a SO:0001263 NCBITaxon:10090
MGI:MGI:3619060 Mir106b microRNA 106b mmu-mir-106b|mir 106b|Mirn106b SO:0001263 NCBITaxon:10090
MGI:MGI:3619063 Mir107 microRNA 107 mmu-mir-107|mir 107|Mirn107 SO:0001263 NCBITaxon:10090

SO:0001263= ncRNA gene

kltm commented 3 years ago

@ukemi IIRC, the snapshot pipeline should have no contact with the GPI files, so those would likely be coming in through GAF. As a side note, I did remember how to get at run information for the NEO build, so we can dive in a little more at the completion of this run (if it doesn't fix things).

hdrabkin commented 3 years ago

when I attempted to put in in an Add individual, it would not enter; clicking on it resulted in no individual being added.

If used in an has_input statement, that annotation did NOT display in annotation preview (no extension with has_input).

ukemi commented 3 years ago

Checked this morning. They still autocomplete as identifiers.org URIs, but now I can no longer select them and have them added to a model. I was able to do this successfully for one yesterday.

So, they are still unavailable for annotation.

kltm commented 3 years ago

@ukemi Hm. That's odd: we have made no updates since yesterday. That said, I'll be reloading the NEO load in just a few minutes. [five minutes into the future] Now refreshed.

kltm commented 3 years ago

Okay, using MGI:3619059 as my test case, I've found that:

SKIPPING line 58541: see https://github.com/geneontology/go-site/issues/595
EXPECTED 10 COLS: 6 :  MGI:MGI:3619059  Mir103-2        microRNA 103-2  mmu-mir-103-2|mir 103-2|mir-103-2|Mirn103-2     SO:0001263      NCBITaxon:10090 

(The proximate command is gzip -dc mirror/mgi.gpi.gz | ./gpi2obo.pl -s Mmus -n mgi > target/neo-mgi.obo.tm p && mv target/neo-mgi.obo.tmp target/neo-mgi.obo)

Looking at these lines in mgi.gpi, they seem to have 10 columns, but the script is looking for at least 7 (?) to be populated (https://github.com/geneontology/neo/blob/03f40dd08551edb4babb8ae4c3c7e3e42c389458/gpi2obo.pl#L43).

Looking around there and at the lines that do make it through, I'm pretty sure that there is a bug in there with how split is being used with LIMIT. Fixing that, it looks like target/neo-mgi.obo likely has the right contents. https://github.com/geneontology/neo/pull/71

kltm commented 3 years ago

We've taken the fix. I'll report here for retesting once the NEO pipeline has been run again.

vanaukenk commented 3 years ago

Thank you @kltm!

kltm commented 3 years ago

@ukemi @vanaukenk Reloaded. I'm now having pretty good luck finding the wayward entities (There also seems to be a ~10% increase in entities overall.)

hdrabkin commented 3 years ago

Ok, as of 4:09 I can input any of the miRNAs we were trying just based on symbols or gene ids which we could not do a few days ago. Thanks @kltm !

vanaukenk commented 3 years ago

@kltm @hdrabkin @ukemi

Thanks for the fix, Seth. I'm closing this ticket; if anything else comes up wrt neo, we can open a new one.