geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
45 stars 89 forks source link

Consider not merging/injecting PAINT files from some groups #597

Open cmungall opened 6 years ago

cmungall commented 6 years ago

Currently we merge PAINT files during the GOC release pipeline. This has the advantage that we do not need to wait for all groups to incorporate into their source database before the public sees them. However, it behooves us to get the PAINT annotations right:

From @ValWood in https://github.com/geneontology/helpdesk/issues/112#issuecomment-380044529

I would prefer if you did not include the pombe data in Amigo Until the issues reported here: https://github.com/geneontology/go-annotation/issues?q=is%3Aissue+is%3Aopen+label%3A%22PAINT+annotation%22 are fixed.

This is why we did not import them into PomBase yet. When we import into PomBase, we will > export with the correct identifiers for downstream softwares.

We are using GO term finder /mapper for analyses (as do our community), and currently the PAINT annotations are 'contaminating' the analyses.

We would not lose very much by excluding because most of the approved annotations are redundant. See also geneontology/go-annotation#1879

cmungall commented 6 years ago

It is possible for us to hold-off the PAINT inject on a group-by-group basis (although this adds complexity to our overall dataflow, which is already complex, involving lots of peer-peer communication...).

However, we should first explore whether @ValWood's issues can be addressed in a timely fashion. @thomaspd and @huaiyumi, comments?

ValWood commented 6 years ago

So, I could theoretically, in the next couple of weeks, import PAINT but filter the ones with IDs mentioned in the tickets: https://github.com/geneontology/go-annotation/issues?q=is%3Aissue+is%3Aopen+label%3A%22PAINT+annotation%22

However, there are 71 PAINT tickets open, and it would be better to make sure that these are resolved (for everyone's benefit). I also hadn't finished checking (I only did spot checks on "unknown" and "Matrix" outliers), but I was hoping for the bulk of the issues reported would be fixed before we imported. I'm sure that once we import there will be more queries....

I now see that most of these tickets have no assignee, so I am not sure if anyone monitors this tracker? I will assign to Pascale @pgaudet to distribute accordingly....

I'm sorry for being so fussy. I think we are the last people to import which is why the issue cropped up, but I spent a lot of time (over a decade!) getting the annotations pristine so I can't bear to muck it up ;(

ValWood commented 6 years ago

Or maybe it is 2 decades. It's quite sad that I know all the genes but I have no idea what year it is....

cmungall commented 6 years ago

@ValWood we discussed this on the go-managers call, and I brought up your concerns. I believe the resolution is that @pgaudet will work through the outstanding pombase tickets on the tracker. Do you think this will resolve the issue you have? You said that these represent only your spot-checks. Is there additional QC we could do?

Pascale, correct me if I have this wrong.

ValWood commented 6 years ago

Well I doubt Pascale will be able to get to these quickly?

So in the meantime we have many erroneous annotations for PomBase genes in GO, and different annotation in GO and in PomBase? There are lots of errors. Some appear to be to do with PAINT families (A minor number) . Most are due to the transfer of problem annotations. So the problems won't go away until all the errors in the originating annotations are fixed (many as yet unreported, I did not check the entire matrix). In addition, at PomBase we don't annotate to causally upstream, but lots of DBs would make these annotations, and they get transferred (with no indication that they are causally upstream). Our users don't expect to see "causally upstream" genes assigned to specific processes (they can use phenotype annotations for this). Also, the taxon checks are not working. Finally we remove annotations redundant with existing manual annotation. So there are lots of different problems, and we would like to deal with those before we submit to the GOC

It would be better not to merge the PAINT files into the GO release pipeline. I was waiting and do a test import, ( and at this point I will evaluate the rest of the annotations as it will be easier as redundant annotations from the pipeline would be filtered). However, there is no point in me generating more annotation tickets when so many are open....I will do this when the current batch are fixed.

These annotation issues have quite a big effect on our data (and hence on our own analysis and that of our users). You can see the number of spurious annotations by comparing the 2 matrix figures in this ticket: Look at the 2 figures, there are LOTs of differences and all of the ones I looked at in an intersect was an annotation error. https://github.com/geneontology/helpdesk/issues/112#issuecomment-378838559

So I can't imagine the problem will be solved for us until the annotations are not put directly into the GOC pipeline. If the file did not go into GOC this would remove the pressure from Pascale to do the tickets soon. I'm sure she want want to be doing this before the GO meeting for sure....

pgaudet commented 6 years ago

Marc and I will look into this this week, I'll know more about how fast we can fix all by Thursday. Will keep you posted.

Pascale

pgaudet commented 6 years ago

That being said I completely agree we should do some QC before injecting the PAINT data into the files.

@cmungall I can send a list of things we could check (@ValWood any input is welcome !)

Thank, Pascale

ValWood commented 6 years ago

Yes, lots will probably come out of the tickets.

I'd still like to have control of the file ;), not because I am a control freak, but up until this file existed the data in PomBase always matched closely the data in AmiGO....so when people click on the AmiGO link they would get the same gene lists. This is no longer the case.... which will be confusing for our users (in addition to all of the above). But there will always be a time lag to fix problems....

pgaudet commented 1 month ago

I think this is out of date?

@kltm are we not injecting PAINT annotations for all the files we produce?

kltm commented 1 month ago

@pgaudet I'm not sure it's out-of-date, but it seems to not be a current topic/issue? The question seems to be that we not inject data; we are currently injecting for the groups listed here: https://github.com/geneontology/go-site/blob/master/metadata/datasets/paint.yaml .