geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
46 stars 89 forks source link

Downloads table shows incorrect annotation counts #1999

Open suzialeksander opened 1 year ago

suzialeksander commented 1 year ago

Reported by a user: http://current.geneontology.org/products/pages/downloads.html annotation counts are apparently showing counts calculated BEFORE IBAs are appended, so the counts are incorrect.

Example:

Homo sapiens EBI Gene Ontology Annotation Database (goa) protein 569544 goa_human.gaf (gzip)

sjcarbon@moiraine:/tmp$:) zcat goa_human.gaf.gz | grep -v '^!' | grep -v [[:space:]]IBA[[:space:]] | wc -l 569544 sjcarbon@moiraine:/tmp$:) zcat goa_human.gaf.gz | grep -v '^!' | wc -l 630741

kltm commented 1 year ago

@suzialeksander @dustine32 I believe I have a fix in place. Could you please check it out the next (next) time a snapshot goes through?

suzialeksander commented 1 year ago

Still incorrect. Checking the chicken protein and human protein files, we now are overcounting: the human protein file has 630741 lines, count on snapshot is 634098.

kltm commented 1 year ago

The produced page lists as: 634098. Line count:

sjcarbon@moiraine:/tmp$:) zcat goa_human.gaf.gz | wc -l
634139

Comment count:

sjcarbon@moiraine:/tmp$:) zcat goa_human.gaf.gz | grep ^! | wc -l
41

634139 - 41 = 634098

I think that you are looking at the last release file? Remember that the downloads.html you are looking at is a hard-coded product for release. You need to pull the file from snapshot.

suzialeksander commented 1 year ago

Oh, the good news is you're correct and the counts seem accurate. The bad news is, from a user standpoint, the files available at snapshot.geneontology.org/ anything are expected to actually come from snapshot. So, this ticket's issue seems to be solved but there's another documentation problem here.

kltm commented 1 year ago

Yes, I can see that. The downloads.html file has always been produced as a "product" that is hard-wired for the use case of creating the page that links from http://geneontology.org/docs/download-go-annotations/ . We could add a "bug" that we want to fix that to be locked to the pipeline instead of hard-coded, so it would work from snapshots as well, but separate from this issue.

kltm commented 1 year ago

This still seems incorrect on release:

339422 vs

bbop@wok:/home/skyhook/release/annotations$ zcat mgi.gaf.gz | wc -l
520798

Regression?

kltm commented 1 year ago

Source code change should be at https://github.com/geneontology/go-site/commit/8c5aae43c30912c16d0fb18d0c35276a1d3827aa

suzialeksander commented 1 year ago

as a point of interest, some counts are correct, the SGD gaf is actually 121657:

Saccharomyces Genome Database (sgd) n/a 121657 sgd.gaf (gzip)

kltm commented 1 year ago

Huh. Something else besides PAINT causing the undercount then? Hm.

kltm commented 1 year ago

Ugh, I think I know the answer: this is the same problem we had before with PAINT, but with a Noctua flavor that we have no way around trivially.

Essentially, I'm 99% sure that the undercount is due to the numbers in the JSON we're reading not including the Noctua annotation count. Up until recently, the Noctua files and counts have been ignored and are apparently not included here as they are optional, most groups do not include them, and there are only GPADs, so they bypass some code points. I believe that only MGI and ZFIN would be affected by this undercount.

Okay, what do we do? The two things that come to mind are:

  1. Read the final GAF length after the fact and report that
  2. Add the Noctua file length and optionality to the metadata and conditionally add that

Ideally, we'd either have all the Noctua data available in the JSON and then conditionally add it (the latter likely being the more annoying bit).

kltm commented 1 year ago

@pgaudet @suzialeksander I think the solution is likely not too hard, but will take some fiddling. While the issue should only affect a handful of the downloads, it is rather confusing. If we need this done "soon", I might recommend bringing somebody else in to help out, maybe @mugitty (who is already working with the combined.reports.json for the assigned_by reports). If we can wait a little for some current obligations I have to expire, I can get back on this.

pgaudet commented 1 year ago

I would be in favor of solution 1,

Read the final GAF length after the fact and report that number

It would be nice to have the numbers right, or to remove the numbers from the webpage until they are fixed. Is this 'easier'?

Thanks, Pascale

kltm commented 1 year ago

@pgaudet Then there would be some fiddling with filenames, having to move files around in the pipeline, etc. "1" can be done, but it likely more work than "2". (see https://github.com/geneontology/go-site/issues/1999#issuecomment-1561800457)

For work order, the choices are removing the number (out next release but easy), have somebody work on it now (we'd have to see who's available and next release), have me work on it (next release when I'm available, this or the next).

kltm commented 1 year ago

@pgaudet I'll strike my comment above: we are doing a lot of weird stuff and likely to do more. Manual line count is probably the best way--I'm with you on "1".

suzialeksander commented 1 year ago

I just checked a few files and counts seems correct for now, not sure if anyone actually addressed it or counts are correct by coincidence

kltm commented 1 year ago

@suzialeksander I suspect more of the latter--nothing has been done specifically, but other mechanisms elsewhere where may be bringing in the numbers for undercount. That said, I would expect MGI and ZFIN to still be affected: https://github.com/geneontology/go-site/issues/1999#issuecomment-1560307480

pgaudet commented 1 year ago

Does anyone object to removing the counts? That seems unnecessary, now that we have stats on the front page of the GO website.

addiehl commented 1 year ago

I really like have the counts on the Download annotation page, as I screen shot this page every year when I teach about GO and it seems very impressive and allows me discuss a bit about the depth of GO annotation and the value of model organism databases. The graph on the statistics page "Number of annotations by evidence" is also useful, but is not set up for direct comparisons of annotations per species. And it is nice to see actual numbers, not just bars. I realize there is concern about the absolute accuracy of the numbers on the Downloads page, but for the most part the numbers are quite close to actual values if I understand the preceding discussion. (in fact I immediately saved a copy of these numbers in case they go away)

pfey03 commented 1 year ago

Our Dicty numbers are a bit off. In QuickGO that has the last release numbers we have a total of 78,853 annotations, probably because of the many obsoletions and IEAs went down. I can download as GPAD, as GAF but where to submit?

pgaudet commented 1 year ago

Hi @addiehl

Have you looked at this page? https://geneontology.org/stats.html You can filter by species to get the evolution of the annotations:

image