Archive in production - Githubissues

lpalbou commented 3 years ago

With the upcoming publication of the GO NAR database article, we need to put this in production sooner rather than later.

I double check again the archive and everything looks good.

Proposed steps:

[x] on this archive, the last release is 2018-03-02; on release.geneontology.org, there are already two releases (2018-03-06 and 2018-03-10) for that month, so possibly @kltm you want to skip the last release of my archive
[x] @kltm copy 1 release and double check everything is good with the new index.html, ping me if issues
[x] @kltm copy the content of s3://geneontology-test/ to the official GO S3 bucket - I advise doing the transfer from an EC2 as with the mysql dumps, it's about 1TB
[x] @kltm use the script here to add the index.html to all releases, including the newer releases still using the old index.html. As a reminder, you have to set the bucket name here https://github.com/lpalbou/aws-js-s3-explorer/blob/master/s3-add-index.py#L5 as I didn't create a command line parameter
[x] @lpalbou create a doc page on the go site to describe the content of the archive and some underlying details (eg CVS up to 2012, SVN afterwards), etc (temporary URL for test: https://geneontology.github.io/docs/go-archives/)
[x] @lpalbou wait for validation of page and modify the GO site menu to provide a clear access to this archive. Note that if we do merge it to release.geneontology.org, it will also contain the newer releases. @pgaudet suggestions ?

Notes:

the GO archive without mysql dumps is about 170go, and about 1Tb with the mysql dumps. If people start downloading massively those mysql dumps, this could have a significant AWS cost; I have billing alarms for our USC account, if we don't have that for the GO account, I would suggest to create one, just to avoid bad surprise
the merging of this archive with release.geneontology.org could be a little unsettling as there are more folders in the current releases (eg bin, meta, etc), so something to mention in the doc page

pgaudet commented 3 years ago

modify the GO site menu to provide a clear access to this archive. Note that if we do merge it to release.geneontology.org, it will also contain the newer releases. @pgaudet suggestions ?

Do you want suggestions about the GO site menu ? I suppose you mean geneontology.org ? We already have a menu, 'Archived data', that seems appropriate ?

Note that if we do merge it to release.geneontology.org, it will also contain the newer releases.

How will this look like in practice ? If this is all the data we can merge 'Data' and 'Archive'-

I hope I understood your question.

Thanks, Pascale

lpalbou commented 3 years ago

Whenever I mention GO site, I mean geneontology.org yes. At the GO meeting, Paul presented another menu design to access the archive, but ok to keep it that way. The whole page has to be rewritten but I can do that.

How will this look like in practice ?

For 2018-02 (built from archive):

For 2018-03 (same content as what we have on release.geneontology.org, just new design - quick mockup):

If we move on with this plan of release.geneontology.org containing both the archive and "newer" releases (2018+), I think this should become the general quick access to GO download; currently the GO site is IMO too complicated to get to your file. In the initial mockup we wanted a "DOWNLOAD" button on the main page, something nice and easy, maybe that's the occasion.

Note: If we want consistency and simplify file access for the 2018+ releases, I could possibly filter out the folders we don't want for these releases and provide a parameter (eg URL parameter) to show them for @kltm usages.

pgaudet commented 3 years ago

I thought improving the download page was out of scope for this project, and that we'd do it at a future iteration, together with

single file per species
go reference species

Otherwise we have to make a change now and another change later.

Am I misremembering ?

Thanks, Pascale

lpalbou commented 3 years ago

What you are referring to is a substantial refactoring of the archive/release file naming and file content, which I think is great but for a later phase indeed. What I was referring to is different:

by merging release.geneontology.org and this archive now, we are mixing both the archive and DOI releases (2018+); newer releases have more files/folders - how do we explain/handle that for users ? Just a mention in the doc ? A simple filtering system that by default would only show "public" folders (annotations, ontology, products) ?
we call that "archive" but it will contain everything, including the current release; the refactoring of the downloads is the way to go, but realistically I don't see that happening before a year. In the meantime, I was suggesting to create a simple "download" button on the main page of the GO site to help users get a simple direct access to a release files via that system. I don't think adding a link is rescoping the project and I genuinely think it would help users, but up to you

kltm commented 3 years ago

@lpalbou As part of the proposed steps above, I'm not seeing what the incremental instructions are? On every release, we'll need to have the index injected into a specific path in the release bucket, as well as current.

kltm commented 3 years ago

@lpalbou Possibly related, shall I parameterize https://github.com/lpalbou/aws-js-s3-explorer/blob/master/s3-add-index.py so that we can access different buckets (release, current, snapshot, and experimental), or is that something that you'd want to take care of?

lpalbou commented 3 years ago

@kltm I didn't include a step for next releases, but for now, if you just re-run the s3-add-index script at every release (once the S3 upload is done), it would work; not super efficient as it's re-uploading the index to every folders but that would do the trick for now and it only take a few minutes.

Possibly related, shall I parameterize

not exactly related indeed but sure we can parametrize that - I'll do an update today but that shouldn't block you in your test right now ?

lpalbou commented 3 years ago

Ok, I made it more generic in case you need it: https://github.com/lpalbou/aws-js-s3-explorer/blob/master/s3-add-file.py

Usage example:

python s3-add-file.py -i index.html -o geneontology-test

-i is input file (local or absolute path) -o is your s3 bucket name without the s3://

kltm commented 3 years ago

@lpalbou Cheers, I'm working on testing the copying now. So, things are working so far in mock:

aws s3 ls s3://geneontology-test/2018-03-02/
aws s3 ls s3://go-data-testing-sandbox
aws s3 sync s3://geneontology-test/2018-03-02 s3://go-data-testing-sandbox

Gives: https://go-data-testing-sandbox.s3.amazonaws.com/index.html So it seems to be working well. I've run into two quirks. The first is that some SVN artifacts seem to be in there (something that maybe can be cleaned up later):

copy: s3://geneontology-test/2018-03-02/annotations/gp2protein/.svn/pristine/af/afe4a0d4b4fca7e65bbb189151fa5c27ff2f08a8.svn-base to s3://go-data-testing-sandbox/annotations/gp2protein/.svn/pristine/af/afe4a0d4b4fca7e65bbb189151fa5c27ff2f08a8.svn-base

the second is that, so far, the permissions of the copied objects seem to be more restrictive than the bucket--even though the bucket is public, the objects are not. I'm obviously looking at awsclient here; do you have any recommendations for command line options or a different client? The command as above does not seems to work for bulk copy as I'd expect as far as permissions go.

For the third item above in your list at the top, you recommend using an EC2 instance? Is the sync command not just going bucket to bucket?

kltm commented 3 years ago

(For later, but it looks like the Content-Type is pretty much compressed to binary/octet-stream instead of text/obo (obo) or application/rdf+xml (owl). Perhaps a once-over uplift in the future.)

kltm commented 3 years ago

I'm having better luck with the permissions with: aws s3 sync --acl public-read s3://geneontology-test/2018-03-02 s3://go-data-testing-sandbox Looking at https://go-data-testing-sandbox.s3.amazonaws.com/index.html and https://go-data-testing-sandbox.s3.amazonaws.com/ontology/index.html the former has no index.html listed and the latter does. A small thing and maybe that changes with the giant indexing that comes later.

lpalbou commented 3 years ago

SVN artifacts

I thought removed all of them, I will double check that

so far, the permissions of the copied objects seem to be more restrictive than the bucket

Are you speaking of the copied index.html or the whole archive copy ? I think it's more a question of how your target bucket is configured / handles new objects. Example of bucket policy:

{
    "Version": "2012-10-17",
    "Id": "Policy1547524097405",
    "Statement": [
        {
            "Sid": "Stmt1547524091089",
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::geneontology-test/*"
        }
    ]
}

Be sure to also have the Bucket ACL: http://acs.amazonaws.com/groups/global/AllUsers = Objects List; Bucket ACL = Read

I am also gonna try that at the same time on a newly created bucket to see the behavior

you recommend using an EC2 instance

It was if you had to copy files from a local drive, then to speed up data transfer, I would use EC2; if you are copying S3 to S3 from command line, it should be fine

Content-Type

It's a little complicated to explain by writing. Long story short S3 supports the storing and live uncompress of files on the fly so I always store compressed files as this is seemless for a end users (the file is automatically uncompress client side without them to be aware and your files don't even had a compress extension like .gz or .zip etc); it dramatically saves space and increase download speed so that's a feature I would recommend to use everywhere

kltm commented 3 years ago

@lpalbou Okay, progress coming along here. I'm setup now in a way that should give pretty good for doing the final work.

Limited testing with:

aws s3 sync --exclude "*" --include "2004-03-01/*" --acl public-read s3://geneontology-test s3://go-data-testing-sandbox

Seems to give good results.

For the Content-Type, we might want to revisit exactly whats going on there, as some (semantic) web applications/ontology tools use that as a hint to do the "right" thing. I'm not sure we'll want that overridden, but we can punt for now ans come back to that later on.

I'm going to move on to trying the initial one (2004-03-01) in the release bucket and see what happens with CF, etc.

kltm commented 3 years ago

@lpalbou Okay, I may have run into an issue when combining this with the CDN? As a test of our release setup, we have the experimental bucket fronted by the experimental CDN:

aws s3 sync --exclude "*" --include "2004-03-01/*" --acl public-read s3://geneontology-test s3://go-data-product-experimental

should be exposed at:

http://experimental.geneontology.io/2004-03-01/index.html

However, it seems to actively be searching for a bucket? I'm guessing that it was grabbing the bucket from the URL then. Is this something that the final indexer would be taking care of?

lpalbou commented 3 years ago

From off github: the bucket name is determined from the URL; if using an alias/cname, then the bucket name can not be inferred, so it has to be encoded when calling the python script.

@kltm still to proceed with the copy as we can update the index.html after. I will try to get a fix out for tomorrow and will test also possible side effects.

lpalbou commented 3 years ago

Ok, @kltm I created a similar S3/CF/Route53 archi on release.geneontology.xyz with just two releases to test and it should now work with your archi. See the URL below served by Route53 -> CF -> S3:

Screen Shot 2020-12-10 at 12 07 36 AM

Notes:

The s3-add-file was renamed in s3_add_file to follow python conventions and allow for the import in a new script s3_add_index
Instead of running the generic s3_add_file, now run instead: python s3_add_index.py -o geneontology-test where -o is your target bucket name; you don't need to specify the index.html anymore as the script is writing the Bucket name inside the index before uploading it to all S3 "folders" of your bucket

Let me know if you encounter any other issue.

kltm commented 3 years ago

From conversation with @lpalbou need the following on the CDN upstream bucket:

[
    {
        "AllowedHeaders": [
            "*"
        ],
        "AllowedMethods": [
            "HEAD",
            "GET"
        ],
        "AllowedOrigins": [
            "*"
        ],
        "ExposeHeaders": []
    }
]

kltm commented 3 years ago

Okay, making progress, but have run into another hiccup. I'm guessing with some of the settings? On http://experimental.geneontology.io/ I'm now getting:

Error accessing S3 bucket go-data-product-experimental. Error: NetworkingError: Network Failure

on root and index.html pages, after running python3 s3_add_index.py -o go-data-product-experimental (underlying bucket).

CORS settings are as above; access "public"; CDN contents invalidated... Content seem available and public (e.g. http://experimental.geneontology.io/notes.txt, http://experimental.geneontology.io/2004-03-01/annotations/sgd.gaf.gz). I'll dig more into this tomorrow and see if there are any leads.

lpalbou commented 3 years ago

Looking at the error log:

It seems you haven't give CORS access to the bucket go-data-product-experimental - see https://github.com/geneontology/archive-reconstruction/issues/9#issuecomment-742932145

Also from the bucket itself, it's working: https://go-data-product-experimental.s3.amazonaws.com/index.html

lpalbou commented 3 years ago

For completeness, here are the configs of the quick bucket I created yesterday to fit your Route53 -> CF -> S3 config:

If you do have the CORS, then I guess you may have forgot to also allow the ACL public list/read as discussed yesterday ?

kltm commented 3 years ago

Okay, after reworking and tweaking things for way too long, it turns out I that the CORS settings we have for what's needed for CF is not sufficient for bucket access (naturally in retrospect) and I kept overlooking them (ugh). Specifically, part of AllowedHeaders.

http://experimental.geneontology.io/ now looking pretty good to me.

Quick question as I convert other buckets over: is Everyone | Bucket ACL | Read necessary for something, or just a quirk of your setup? It seems to work fine so far without, but I have not fully explored yet.

lpalbou commented 3 years ago

CORS: yep, it's one of the most common issues on the web I guess.

now looking pretty good to me

I gave it a quick look too and both the browsing and downloading was working.

Everyone | Bucket ACL | Read

It was actually for another piece of code so the only mandatory here is indeed the Everyone | Objects | List which is used by the page to list the "folders" and files of the bucket.

Let me know when everything is pushed to release.go.org so that I update the GO site menu/archive page.

lpalbou commented 3 years ago

Just one thing though, I would highly suggest that release.geneontology.org be deployed as https; on a CF point of view, this should be quite easy as you can use a certificate generated by AWS since geneontology.org is managed by route53 (e.g. that's what I did in my example above with https://release.geneontology.xyz/index.html).

We really need to make progress on that and we need to start somewhere, otherwise it will keep on creating issues (e.g. I initially started the gocam api because the golr endpoint is http only) and in the meantime, external sites can not use or link our resources as most of them are https (it's actually even a requirement for publication now)

kltm commented 3 years ago

release.go.org now has the historical files contained, but not migrated and still using the old capping.

Before I plow the rest through, there was one oddity in the way paths seem to be working. If I go to, for example, http://release.geneontology.org/2004-03-01/ , it gives me the root directory, with all links from that point on shifted (and not working). This seems to have also been true in experimental.geneontology.io, but I did not catch it.

lpalbou commented 3 years ago

The UI was designed to work on full URL, e.g. http://release.geneontology.org/2004-03-01/index.html, not http://release.geneontology.org/2004-03-01/. From our discussion, you were supposed to handle those redirects with AWS.

If this is a new requirement on the UI side, I can look into it but this does not affect external users as they will be redirected to the correct URL from the GO site. We can easily iterate on this later rather than delaying the deployment.

In addition and as already discussed, it would be useful to share some AWS resources (see https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html), otherwise I have to guess parts of your architecture and configurations. This would avoid those last minute oddity.

kltm commented 3 years ago

@lpalbou Yeah, we can iterate on this under https://github.com/geneontology/pipeline/issues/203 or a new issue for remaining items--definitely something I want included, but not directly tied to getting this into production.

The "final" indexing is now running (just gunna do the whole thing again), which should make it visible pretty soon. The top-level and future releases will reset to the current template (what's in the pipeline now), but that's over at https://github.com/geneontology/pipeline/issues/203 .

kltm commented 3 years ago

Checking off items above.

lpalbou commented 3 years ago

@kltm not yet visible on http://release.geneontology.org/index.html . Did you remove your path filter for the tests ? When it's up, I will update the link to the archive and documentation.

For the oddity, I'll try to fix that after this and the GO sparql endpoints are online, but that will be a guess since I don't have access to your config.

kltm commented 3 years ago

@lpalbou Nah, was just hoping that the CDN would cache out. I just gave it a manual poke--it looks like it's switched now. For the rest, we can refresh in the new year.

lpalbou commented 3 years ago

I put a temporary doc page up to describe and link the archive: http://geneontology.org/docs/go-archives/

If that sounds good, I will update and replace the old http://geneontology.org/docs/archives/

lpalbou commented 3 years ago

@thomaspd @cmungall @pgaudet

lpalbou commented 3 years ago

Since http://release.geneontology.org contains both the archive and the doi releases, and it's now much more user friendly, I would recommend to create a "Download" button on the main page so that users get a direct access to files.

lpalbou commented 3 years ago

@kltm it seems http://current.geneontology.org is a different S3, so the index.html would need to be updated there too.

pgaudet commented 3 years ago

I put a temporary doc page up to describe and link the archive: http://geneontology.org/docs/go-archives/

Some comments:

I dont think we should put 'historical' information, we should only describe the data we provide. So this sentence

' The Gene Ontology consortium has released in December 2020 a comprehensive archive covering the past 15 years of GO data with a consistent format and file structure for both the ontology and annotations.

could be changed to

Comprehensive GO archive of the ontology and annotations from 2004. (If you want we can add "Note that this replaces the former CVS, SVN and product archives." - although I dont know what 'product archives' are)

I am not sure we need the next section

Archive content and consistency The GO archive contains the monthly releases built from 2004 to Feb 2018 with the deprecated GO CVS, SVN and product archives. The archive also contains all the GO DOI monthly releases (start in March 2018). Each monthly release was built using the same folder hierarchy as our current GO DOI releases:

and the two screenshots. A single screenshot for the current content would be fine.

Esthetic question: do we need the trailing forward slash in the labels ? It would look less geeky if we could hide it.

(comments on the description of contents coming in the next comment)

pgaudet commented 3 years ago

Suggestions for folders descriptions:

I would remove the * and add dates as appropriate. I tried to do that here, please check that it's right. Add a note: "Note that some new files and file formats have been added over the years, so that the content of each archive has evolved over time.""

annotations/ : GO annotations as GAF files (2004-current), and additionally GPAD and GPI files from March, 2018 [with the GO DOI releases -> what does that mean?]
annotations/gp2protein/ (*) : mappings of contributing group IDs (usually MOD ids) for protein gene products to UniProtKB accession numbers, from 2004 to Feb 2018
annotations/gp2rna/ (*): mappings of contributing group IDs (usually MOD ids) for non-coding RNAs products to RNACentral IDs, from 2004 to Feb 2018 [really?]
ontology/ : GO ontology as .obo and .owl files. More information on the various ontology files can be found here: http://geneontology.org/docs/download-ontology/

[- users are recommended to use ontology/go.obo if they don’t need to go back further than March 2009 and ontology/gene_ontology.obo (old obo format) if they need to go back to the beginning of the archive] -> We need to figure out the dates when the obo format was changed. Both March and April 2009 have obo 1.0 in the header Current version is 1.4 When did it change ? It is possible to parse all headers ?

ontology/extensions/ (**): contains the various ontologies imported or produced by GO from May 2015.
ontology/external2go/ : mapping of GO terms to different resources, including InterPro, Rhea, KEGG, and Reactome.
ontology/subsets/ (***) : contains the GO subsets (also known as slims) used to simplify the ontology for specific purposes (e.g. goslim_synapse) or organisms (e.g. goslim_pombe) - we recommend to use .obo files rather than old deprecated .go files -> do we always have both formats ?
mysql_dumps/ (**) : contains the MySQL dumps of GO (e.g. -assocdb , -termdb), from May 2015.
products/annotations : GO annotations files provided by the contributing groups. Those files are kept for transparency but users are recommended to use the GO annotations in the annotations/ folder, as they can differ due to different version of the ontology, as well as various filtering and checks performed by the GO consortium to ensure quality.

It would be a good opportunity to rename some folders - especially /extensions and /products/annotations that are extremely confusing.

Awesome that we're almost there :)

Thanks, Pascale

lpalbou commented 3 years ago

@pgaudet I will include your changes.

It would be a good opportunity to rename some folders - especially /extensions and /products/annotations that are extremely confusing.

Those folders are currently used by the GO DOI releases, if we change them, we have to change them for both the archive and DOI releases. You know I find it unintuitive to have input annotations in a products/ folder, but this now seems out of scope for a project at the end. Quick fix: just hide the /products/ folder from every releases. Otherwise, this could be done when we refactor the downloads by species.

@kltm I simulated what I believe is your config of the GO AWS release.geneontology.org to test the issue when index.html is not automatically added to the URL: https://github.com/lpalbou/aws-js-s3-explorer/pull/1 . It also update the font-style to match the one of the GO site.

To update the index with those fixes for the full archive, it's like the last time, you just have to run with your AWS auth:

python3 s3_add_index.py -o <s3_repo_name>

FYI, it's working on both my S3 and CF:

pgaudet commented 3 years ago

Thanks @lpalbou !

Those folders are currently used by the GO DOI releases, if we change them, we have to change them for both the archive and DOI releases. You know I find it unintuitive to have input annotations in a products/ folder, but this now seems out of scope for a project at the end.

OK, sounds good. I am aware it is out of scope, but always going to be painful to change it. We can add this as a task for the refactor the downloads by species (but we can also argue that it's out of scope ;)

Quick fix: just hide the /products/ folder from every releases.

Do you mean the whole /products folder ? Or products/annotations ? I would be in favor of this. Do you have any idea if anyone is using this ? @kltm @thomaspd @cmungall what do you think ?

Thanks, Pascale

kltm commented 3 years ago

With the last release, I've re-indexed release.geneontology.org--it's looking good to me!

/products is currently used for pickups by various groups for noctua-only and prediction products.

lpalbou commented 3 years ago

I released the archive taking into account the above comments: https://github.com/geneontology/geneontology.github.io/pull/279

It is now accessible from the go website with the menu downloads/GO archive. If there are other issues, please create another ticket as this one was to put the archive in production.

Answers to questions above:

Regarding obo files in the archive and newer releases:

go.obo is always version 1.2 from March 2009 to now. go-basic.obo is also 1.2
gene_ontology.obo is always version 1.0 from 2004/03 to 2018/02 (file removed from newer releases)

Regarding the presentation of the archive:

yes, we can filter out by default the folders we don't want to display for external users
we can have either a parameter in the URL or a toggle button to show all folders (e.g. products/) when needed by people internally

geneontology / archive-reconstruction

Archive in production #9

Answers to questions above: