Open suzialeksander opened 1 year ago
@suzialeksander @dustine32 I believe I have a fix in place. Could you please check it out the next (next) time a snapshot
goes through?
Still incorrect. Checking the chicken protein and human protein files, we now are overcounting: the human protein file has 630741 lines, count on snapshot
is 634098.
The produced page lists as: 634098
.
Line count:
sjcarbon@moiraine:/tmp$:) zcat goa_human.gaf.gz | wc -l
634139
Comment count:
sjcarbon@moiraine:/tmp$:) zcat goa_human.gaf.gz | grep ^! | wc -l
41
634139 - 41 = 634098
I think that you are looking at the last release file? Remember that the downloads.html you are looking at is a hard-coded product for release. You need to pull the file from snapshot.
Oh, the good news is you're correct and the counts seem accurate. The bad news is, from a user standpoint, the files available at snapshot.geneontology.org/ anything are expected to actually come from snapshot. So, this ticket's issue seems to be solved but there's another documentation problem here.
Yes, I can see that.
The downloads.html file has always been produced as a "product" that is hard-wired for the use case of creating the page that links from http://geneontology.org/docs/download-go-annotations/ .
We could add a "bug" that we want to fix that to be locked to the pipeline instead of hard-coded, so it would work from snapshot
s as well, but separate from this issue.
This still seems incorrect on release
:
339422 vs
bbop@wok:/home/skyhook/release/annotations$ zcat mgi.gaf.gz | wc -l
520798
Regression?
Source code change should be at https://github.com/geneontology/go-site/commit/8c5aae43c30912c16d0fb18d0c35276a1d3827aa
Huh. Something else besides PAINT causing the undercount then? Hm.
Ugh, I think I know the answer: this is the same problem we had before with PAINT, but with a Noctua flavor that we have no way around trivially.
Essentially, I'm 99% sure that the undercount is due to the numbers in the JSON we're reading not including the Noctua annotation count. Up until recently, the Noctua files and counts have been ignored and are apparently not included here as they are optional, most groups do not include them, and there are only GPADs, so they bypass some code points. I believe that only MGI and ZFIN would be affected by this undercount.
Okay, what do we do? The two things that come to mind are:
Ideally, we'd either have all the Noctua data available in the JSON and then conditionally add it (the latter likely being the more annoying bit).
@pgaudet @suzialeksander I think the solution is likely not too hard, but will take some fiddling. While the issue should only affect a handful of the downloads, it is rather confusing. If we need this done "soon", I might recommend bringing somebody else in to help out, maybe @mugitty (who is already working with the combined.reports.json for the assigned_by reports). If we can wait a little for some current obligations I have to expire, I can get back on this.
I would be in favor of solution 1,
Read the final GAF length after the fact and report that number
It would be nice to have the numbers right, or to remove the numbers from the webpage until they are fixed. Is this 'easier'?
Thanks, Pascale
@pgaudet Then there would be some fiddling with filenames, having to move files around in the pipeline, etc. "1" can be done, but it likely more work than "2". (see https://github.com/geneontology/go-site/issues/1999#issuecomment-1561800457)
For work order, the choices are removing the number (out next release but easy), have somebody work on it now (we'd have to see who's available and next release), have me work on it (next release when I'm available, this or the next).
@pgaudet I'll strike my comment above: we are doing a lot of weird stuff and likely to do more. Manual line count is probably the best way--I'm with you on "1".
I just checked a few files and counts seems correct for now, not sure if anyone actually addressed it or counts are correct by coincidence
@suzialeksander I suspect more of the latter--nothing has been done specifically, but other mechanisms elsewhere where may be bringing in the numbers for undercount. That said, I would expect MGI and ZFIN to still be affected: https://github.com/geneontology/go-site/issues/1999#issuecomment-1560307480
Does anyone object to removing the counts? That seems unnecessary, now that we have stats on the front page of the GO website.
I really like have the counts on the Download annotation page, as I screen shot this page every year when I teach about GO and it seems very impressive and allows me discuss a bit about the depth of GO annotation and the value of model organism databases. The graph on the statistics page "Number of annotations by evidence" is also useful, but is not set up for direct comparisons of annotations per species. And it is nice to see actual numbers, not just bars. I realize there is concern about the absolute accuracy of the numbers on the Downloads page, but for the most part the numbers are quite close to actual values if I understand the preceding discussion. (in fact I immediately saved a copy of these numbers in case they go away)
Our Dicty numbers are a bit off. In QuickGO that has the last release numbers we have a total of 78,853 annotations, probably because of the many obsoletions and IEAs went down. I can download as GPAD, as GAF but where to submit?
Hi @addiehl
Have you looked at this page? https://geneontology.org/stats.html You can filter by species to get the evolution of the annotations:
Reported by a user: http://current.geneontology.org/products/pages/downloads.html annotation counts are apparently showing counts calculated BEFORE IBAs are appended, so the counts are incorrect.
Example:
sjcarbon@moiraine:/tmp$:) zcat goa_human.gaf.gz | grep -v '^!' | grep -v [[:space:]]IBA[[:space:]] | wc -l 569544 sjcarbon@moiraine:/tmp$:) zcat goa_human.gaf.gz | grep -v '^!' | wc -l 630741