Closed lpalbou closed 3 years ago
modify the GO site menu to provide a clear access to this archive. Note that if we do merge it to release.geneontology.org, it will also contain the newer releases. @pgaudet suggestions ?
Do you want suggestions about the GO site menu ? I suppose you mean geneontology.org ? We already have a menu, 'Archived data', that seems appropriate ?
Note that if we do merge it to release.geneontology.org, it will also contain the newer releases.
How will this look like in practice ? If this is all the data we can merge 'Data' and 'Archive'-
I hope I understood your question.
Thanks, Pascale
Whenever I mention GO site, I mean geneontology.org yes. At the GO meeting, Paul presented another menu design to access the archive, but ok to keep it that way. The whole page has to be rewritten but I can do that.
How will this look like in practice ?
For 2018-02 (built from archive):
For 2018-03 (same content as what we have on release.geneontology.org, just new design - quick mockup):
If we move on with this plan of release.geneontology.org containing both the archive and "newer" releases (2018+), I think this should become the general quick access to GO download; currently the GO site is IMO too complicated to get to your file. In the initial mockup we wanted a "DOWNLOAD" button on the main page, something nice and easy, maybe that's the occasion.
Note: If we want consistency and simplify file access for the 2018+ releases, I could possibly filter out the folders we don't want for these releases and provide a parameter (eg URL parameter) to show them for @kltm usages.
I thought improving the download page was out of scope for this project, and that we'd do it at a future iteration, together with
Otherwise we have to make a change now and another change later.
Am I misremembering ?
Thanks, Pascale
What you are referring to is a substantial refactoring of the archive/release file naming and file content, which I think is great but for a later phase indeed. What I was referring to is different:
by merging release.geneontology.org and this archive now, we are mixing both the archive and DOI releases (2018+); newer releases have more files/folders - how do we explain/handle that for users ? Just a mention in the doc ? A simple filtering system that by default would only show "public" folders (annotations, ontology, products) ?
we call that "archive" but it will contain everything, including the current release; the refactoring of the downloads is the way to go, but realistically I don't see that happening before a year. In the meantime, I was suggesting to create a simple "download" button on the main page of the GO site to help users get a simple direct access to a release files via that system. I don't think adding a link is rescoping the project and I genuinely think it would help users, but up to you
@lpalbou As part of the proposed steps above, I'm not seeing what the incremental instructions are? On every release, we'll need to have the index injected into a specific path in the release bucket, as well as current.
@lpalbou Possibly related, shall I parameterize https://github.com/lpalbou/aws-js-s3-explorer/blob/master/s3-add-index.py so that we can access different buckets (release, current, snapshot, and experimental), or is that something that you'd want to take care of?
@kltm I didn't include a step for next releases, but for now, if you just re-run the s3-add-index script at every release (once the S3 upload is done), it would work; not super efficient as it's re-uploading the index to every folders but that would do the trick for now and it only take a few minutes.
Possibly related, shall I parameterize
not exactly related indeed but sure we can parametrize that - I'll do an update today but that shouldn't block you in your test right now ?
Ok, I made it more generic in case you need it: https://github.com/lpalbou/aws-js-s3-explorer/blob/master/s3-add-file.py
Usage example:
python s3-add-file.py -i index.html -o geneontology-test
-i is input file (local or absolute path) -o is your s3 bucket name without the s3://
@lpalbou Cheers, I'm working on testing the copying now. So, things are working so far in mock:
aws s3 ls s3://geneontology-test/2018-03-02/
aws s3 ls s3://go-data-testing-sandbox
aws s3 sync s3://geneontology-test/2018-03-02 s3://go-data-testing-sandbox
Gives: https://go-data-testing-sandbox.s3.amazonaws.com/index.html So it seems to be working well. I've run into two quirks. The first is that some SVN artifacts seem to be in there (something that maybe can be cleaned up later):
copy: s3://geneontology-test/2018-03-02/annotations/gp2protein/.svn/pristine/af/afe4a0d4b4fca7e65bbb189151fa5c27ff2f08a8.svn-base to s3://go-data-testing-sandbox/annotations/gp2protein/.svn/pristine/af/afe4a0d4b4fca7e65bbb189151fa5c27ff2f08a8.svn-base
the second is that, so far, the permissions of the copied objects seem to be more restrictive than the bucket--even though the bucket is public, the objects are not. I'm obviously looking at awsclient here; do you have any recommendations for command line options or a different client? The command as above does not seems to work for bulk copy as I'd expect as far as permissions go.
For the third item above in your list at the top, you recommend using an EC2 instance? Is the sync command not just going bucket to bucket?
(For later, but it looks like the Content-Type is pretty much compressed to binary/octet-stream
instead of text/obo (obo) or application/rdf+xml
(owl). Perhaps a once-over uplift in the future.)
I'm having better luck with the permissions with:
aws s3 sync --acl public-read s3://geneontology-test/2018-03-02 s3://go-data-testing-sandbox
Looking at
https://go-data-testing-sandbox.s3.amazonaws.com/index.html
and
https://go-data-testing-sandbox.s3.amazonaws.com/ontology/index.html
the former has no index.html listed and the latter does. A small thing and maybe that changes with the giant indexing that comes later.
SVN artifacts
I thought removed all of them, I will double check that
so far, the permissions of the copied objects seem to be more restrictive than the bucket
Are you speaking of the copied index.html or the whole archive copy ? I think it's more a question of how your target bucket is configured / handles new objects. Example of bucket policy:
{
"Version": "2012-10-17",
"Id": "Policy1547524097405",
"Statement": [
{
"Sid": "Stmt1547524091089",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::geneontology-test/*"
}
]
}
Be sure to also have the Bucket ACL: http://acs.amazonaws.com/groups/global/AllUsers = Objects List; Bucket ACL = Read
I am also gonna try that at the same time on a newly created bucket to see the behavior
you recommend using an EC2 instance
It was if you had to copy files from a local drive, then to speed up data transfer, I would use EC2; if you are copying S3 to S3 from command line, it should be fine
Content-Type
It's a little complicated to explain by writing. Long story short S3 supports the storing and live uncompress of files on the fly so I always store compressed files as this is seemless for a end users (the file is automatically uncompress client side without them to be aware and your files don't even had a compress extension like .gz or .zip etc); it dramatically saves space and increase download speed so that's a feature I would recommend to use everywhere
@lpalbou Okay, progress coming along here. I'm setup now in a way that should give pretty good for doing the final work.
Limited testing with:
aws s3 sync --exclude "*" --include "2004-03-01/*" --acl public-read s3://geneontology-test s3://go-data-testing-sandbox
Seems to give good results.
For the Content-Type, we might want to revisit exactly whats going on there, as some (semantic) web applications/ontology tools use that as a hint to do the "right" thing. I'm not sure we'll want that overridden, but we can punt for now ans come back to that later on.
I'm going to move on to trying the initial one (2004-03-01) in the release bucket and see what happens with CF, etc.
@lpalbou Okay, I may have run into an issue when combining this with the CDN? As a test of our release setup, we have the experimental bucket fronted by the experimental CDN:
aws s3 sync --exclude "*" --include "2004-03-01/*" --acl public-read s3://geneontology-test s3://go-data-product-experimental
should be exposed at:
http://experimental.geneontology.io/2004-03-01/index.html
However, it seems to actively be searching for a bucket? I'm guessing that it was grabbing the bucket from the URL then. Is this something that the final indexer would be taking care of?
From off github: the bucket name is determined from the URL; if using an alias/cname, then the bucket name can not be inferred, so it has to be encoded when calling the python script.
@kltm still to proceed with the copy as we can update the index.html after. I will try to get a fix out for tomorrow and will test also possible side effects.
Ok, @kltm I created a similar S3/CF/Route53 archi on release.geneontology.xyz with just two releases to test and it should now work with your archi. See the URL below served by Route53 -> CF -> S3:
Notes:
s3-add-file
was renamed in s3_add_file
to follow python conventions and allow for the import in a new script s3_add_indexpython s3_add_index.py -o geneontology-test
where -o is your target bucket name; you don't need to specify the index.html anymore as the script is writing the Bucket name inside the index before uploading it to all S3 "folders" of your bucketLet me know if you encounter any other issue.
From conversation with @lpalbou need the following on the CDN upstream bucket:
[
{
"AllowedHeaders": [
"*"
],
"AllowedMethods": [
"HEAD",
"GET"
],
"AllowedOrigins": [
"*"
],
"ExposeHeaders": []
}
]
Okay, making progress, but have run into another hiccup. I'm guessing with some of the settings? On http://experimental.geneontology.io/ I'm now getting:
Error accessing S3 bucket go-data-product-experimental. Error: NetworkingError: Network Failure
on root and index.html pages, after running python3 s3_add_index.py -o go-data-product-experimental
(underlying bucket).
CORS settings are as above; access "public"; CDN contents invalidated... Content seem available and public (e.g. http://experimental.geneontology.io/notes.txt, http://experimental.geneontology.io/2004-03-01/annotations/sgd.gaf.gz). I'll dig more into this tomorrow and see if there are any leads.
Looking at the error log:
It seems you haven't give CORS access to the bucket go-data-product-experimental - see https://github.com/geneontology/archive-reconstruction/issues/9#issuecomment-742932145
Also from the bucket itself, it's working: https://go-data-product-experimental.s3.amazonaws.com/index.html
For completeness, here are the configs of the quick bucket I created yesterday to fit your Route53 -> CF -> S3 config:
If you do have the CORS, then I guess you may have forgot to also allow the ACL public list/read as discussed yesterday ?
Okay, after reworking and tweaking things for way too long, it turns out I that the CORS settings we have for what's needed for CF is not sufficient for bucket access (naturally in retrospect) and I kept overlooking them (ugh). Specifically, part of AllowedHeaders
.
http://experimental.geneontology.io/ now looking pretty good to me.
Quick question as I convert other buckets over: is Everyone | Bucket ACL | Read
necessary for something, or just a quirk of your setup? It seems to work fine so far without, but I have not fully explored yet.
CORS: yep, it's one of the most common issues on the web I guess.
now looking pretty good to me
I gave it a quick look too and both the browsing and downloading was working.
Everyone | Bucket ACL | Read
It was actually for another piece of code so the only mandatory here is indeed the Everyone | Objects | List
which is used by the page to list the "folders" and files of the bucket.
Let me know when everything is pushed to release.go.org so that I update the GO site menu/archive page.
Just one thing though, I would highly suggest that release.geneontology.org be deployed as https; on a CF point of view, this should be quite easy as you can use a certificate generated by AWS since geneontology.org is managed by route53 (e.g. that's what I did in my example above with https://release.geneontology.xyz/index.html).
We really need to make progress on that and we need to start somewhere, otherwise it will keep on creating issues (e.g. I initially started the gocam api because the golr endpoint is http only) and in the meantime, external sites can not use or link our resources as most of them are https (it's actually even a requirement for publication now)
release.go.org now has the historical files contained, but not migrated and still using the old capping.
Before I plow the rest through, there was one oddity in the way paths seem to be working. If I go to, for example, http://release.geneontology.org/2004-03-01/ , it gives me the root directory, with all links from that point on shifted (and not working). This seems to have also been true in experimental.geneontology.io, but I did not catch it.
The UI was designed to work on full URL, e.g. http://release.geneontology.org/2004-03-01/index.html, not http://release.geneontology.org/2004-03-01/. From our discussion, you were supposed to handle those redirects with AWS.
If this is a new requirement on the UI side, I can look into it but this does not affect external users as they will be redirected to the correct URL from the GO site. We can easily iterate on this later rather than delaying the deployment.
In addition and as already discussed, it would be useful to share some AWS resources (see https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html), otherwise I have to guess parts of your architecture and configurations. This would avoid those last minute oddity.
@lpalbou Yeah, we can iterate on this under https://github.com/geneontology/pipeline/issues/203 or a new issue for remaining items--definitely something I want included, but not directly tied to getting this into production.
The "final" indexing is now running (just gunna do the whole thing again), which should make it visible pretty soon. The top-level and future releases will reset to the current template (what's in the pipeline now), but that's over at https://github.com/geneontology/pipeline/issues/203 .
Checking off items above.
@kltm not yet visible on http://release.geneontology.org/index.html . Did you remove your path filter for the tests ? When it's up, I will update the link to the archive and documentation.
For the oddity, I'll try to fix that after this and the GO sparql endpoints are online, but that will be a guess since I don't have access to your config.
@lpalbou Nah, was just hoping that the CDN would cache out. I just gave it a manual poke--it looks like it's switched now. For the rest, we can refresh in the new year.
I put a temporary doc page up to describe and link the archive: http://geneontology.org/docs/go-archives/
If that sounds good, I will update and replace the old http://geneontology.org/docs/archives/
@thomaspd @cmungall @pgaudet
Since http://release.geneontology.org contains both the archive and the doi releases, and it's now much more user friendly, I would recommend to create a "Download" button on the main page so that users get a direct access to files.
@kltm it seems http://current.geneontology.org is a different S3, so the index.html would need to be updated there too.
I put a temporary doc page up to describe and link the archive: http://geneontology.org/docs/go-archives/
Some comments:
' The Gene Ontology consortium has released in December 2020 a comprehensive archive covering the past 15 years of GO data with a consistent format and file structure for both the ontology and annotations.
could be changed to
Comprehensive GO archive of the ontology and annotations from 2004. (If you want we can add "Note that this replaces the former CVS, SVN and product archives." - although I dont know what 'product archives' are)
Archive content and consistency The GO archive contains the monthly releases built from 2004 to Feb 2018 with the deprecated GO CVS, SVN and product archives. The archive also contains all the GO DOI monthly releases (start in March 2018). Each monthly release was built using the same folder hierarchy as our current GO DOI releases:
and the two screenshots. A single screenshot for the current content would be fine.
(comments on the description of contents coming in the next comment)
Suggestions for folders descriptions:
annotations/ : GO annotations as GAF files (2004-current), and additionally GPAD and GPI files from March, 2018 [with the GO DOI releases -> what does that mean?]
annotations/gp2protein/ (*) : mappings of contributing group IDs (usually MOD ids) for protein gene products to UniProtKB accession numbers, from 2004 to Feb 2018
annotations/gp2rna/ (*): mappings of contributing group IDs (usually MOD ids) for non-coding RNAs products to RNACentral IDs, from 2004 to Feb 2018 [really?]
ontology/ : GO ontology as .obo and .owl files. More information on the various ontology files can be found here: http://geneontology.org/docs/download-ontology/
[- users are recommended to use ontology/go.obo if they don’t need to go back further than March 2009 and ontology/gene_ontology.obo (old obo format) if they need to go back to the beginning of the archive] -> We need to figure out the dates when the obo format was changed. Both March and April 2009 have obo 1.0 in the header Current version is 1.4 When did it change ? It is possible to parse all headers ?
ontology/extensions/ (**): contains the various ontologies imported or produced by GO from May 2015.
ontology/external2go/ : mapping of GO terms to different resources, including InterPro, Rhea, KEGG, and Reactome.
ontology/subsets/ (***) : contains the GO subsets (also known as slims) used to simplify the ontology for specific purposes (e.g. goslim_synapse) or organisms (e.g. goslim_pombe) - we recommend to use .obo files rather than old deprecated .go files -> do we always have both formats ?
mysql_dumps/ (**) : contains the MySQL dumps of GO (e.g. -assocdb , -termdb), from May 2015.
products/annotations : GO annotations files provided by the contributing groups. Those files are kept for transparency but users are recommended to use the GO annotations in the annotations/ folder, as they can differ due to different version of the ontology, as well as various filtering and checks performed by the GO consortium to ensure quality.
Awesome that we're almost there :)
Thanks, Pascale
@pgaudet I will include your changes.
It would be a good opportunity to rename some folders - especially /extensions and /products/annotations that are extremely confusing.
Those folders are currently used by the GO DOI releases, if we change them, we have to change them for both the archive and DOI releases. You know I find it unintuitive to have input annotations in a products/ folder, but this now seems out of scope for a project at the end. Quick fix: just hide the /products/ folder from every releases. Otherwise, this could be done when we refactor the downloads by species.
@kltm I simulated what I believe is your config of the GO AWS release.geneontology.org to test the issue when index.html is not automatically added to the URL: https://github.com/lpalbou/aws-js-s3-explorer/pull/1 . It also update the font-style to match the one of the GO site.
To update the index with those fixes for the full archive, it's like the last time, you just have to run with your AWS auth:
python3 s3_add_index.py -o <s3_repo_name>
FYI, it's working on both my S3 and CF:
Thanks @lpalbou !
Those folders are currently used by the GO DOI releases, if we change them, we have to change them for both the archive and DOI releases. You know I find it unintuitive to have input annotations in a products/ folder, but this now seems out of scope for a project at the end.
OK, sounds good. I am aware it is out of scope, but always going to be painful to change it. We can add this as a task for the refactor the downloads by species (but we can also argue that it's out of scope ;)
Quick fix: just hide the /products/ folder from every releases.
Do you mean the whole /products folder ? Or products/annotations ? I would be in favor of this. Do you have any idea if anyone is using this ? @kltm @thomaspd @cmungall what do you think ?
Thanks, Pascale
With the last release, I've re-indexed release.geneontology.org--it's looking good to me!
/products is currently used for pickups by various groups for noctua-only and prediction products.
I released the archive taking into account the above comments: https://github.com/geneontology/geneontology.github.io/pull/279
It is now accessible from the go website with the menu downloads/GO archive. If there are other issues, please create another ticket as this one was to put the archive in production.
Regarding obo files in the archive and newer releases:
Regarding the presentation of the archive:
With the upcoming publication of the GO NAR database article, we need to put this in production sooner rather than later.
I double check again the archive and everything looks good.
Proposed steps:
Notes: