Open kltm opened 2 years ago
Working on issue-265-go-cam-products
branch.
@dustine32 So, much to my amazement, we seem to have something that is beginning to work...
http://skyhook.berkeleybop.org/issue-265-go-cam-products/products/api-static-files/
I'd like a little feedback and information from you, but there is a start here. I was trying to make something that would work without the modification of any upstreamand without making new docker images, so there is some weirdness in there (e.g. creating new runtime scripts for the blazegraph docker environment, using sed to make runtimes and other changes on the fly, setting up maven and installing nodejs after the fact, nested working directories), but there is a working base here nonetheless. So, questions:
A lot of questions there, so if it's easier, we can touch bases on voice.
@kltm Whoa. I'm amazed you were able to hack all of my "manual" commands in to the pipeline. Great job!
blazegraph-production.jnl
(in the end, after the rest of the coding/testing is worked out). If a blazegraph-production.jnl
is produced in release
, snapshot
, and master
, can we just default to grabbing whatever blazegraph-production.jnl
is already produced/laying around locally in that branch/run? Or like, wget -N http://skyhook.berkeleybop.org/$BRANCH_NAME/products/blazegraph/blazegraph-production.jnl.gz
?jetty
server and api-gorest-2021
app and replace with these components:
a. blazegraph-runner
cmd
b. SPARQL query files (4 of them) - stored in go-site/pipeline/sparql/reports/
c. A small script to convert (handle grouping) the blazegraph-runner
output to JSON structure expected by GO-CAM siteweb-gocam
point to the skyhook files and adjust until it works?@kltm Actually, I'm playing around with blazegraph-runner
locally now and I realized having 4 separate blazegraph-runner
cmds means loading the journal 4 times, which is taking a while. So now I'm appreciating that "load-journal-once" jetty
endpoint. I'm thinking we just keep that part?
@dustine32 Okay, comments on comments...
@kltm Actually, I'm playing around with
blazegraph-runner
locally now and I realized having 4 separateblazegraph-runner
cmds means loading the journal 4 times, which is taking a while. So now I'm appreciating that "load-journal-once"jetty
endpoint. I'm thinking we just keep that part?
@dustine32 if I need to run queries in parallel I would build the journal in another target and then cp
the journal to a new file in each target before running. Or else just run all queries in one target (probably makes the most sense).
@dustine32 The "full" test run on the production file only takes ten minutes on this end, which is pretty good, especially as I can see things that can easily be sped up, like using pigz
.
So, the full versions are now available (with "correct" extensions) on S3 at places like:
https://go-public.s3.amazonaws.com/files/gocam-goterms.json https://go-public.s3.amazonaws.com/files/gocam-goterms.json.gz
and so on. They are also available on skyhook, but those might disappear during runs.
I guess this puts this back into your court with testing against the S3 products to see if they work?
@kltm Running a local instance of https://github.com/geneontology/web-gocam/commit/ada645bd17851e09ac561749ec7c2367939b17c5, I tested and confirmed the non-gzipped URLs work with the GO-CAM browser site.
@dustine32 Okay, great. Since they aren't too large, I'm going to go ahead and remove the gzipped versions from our new pipeline and deployment.
@dustine32 Okay, done. The next step above "GO-CAM API at new products (temporary)" could technically be a stable terminal state (even though we don't want it to be), so a little less worry for us. I think this one is probably on your plate? Would you like people to work on that with you and spread the knowledge? Also, it's probably good to update our internal documentation for this new stable state, even though it's meant to be temporary.
Talked to @dustine32 and he clarified some of my confusion: this only needs to update the GO-CAM website, not the GO-CAM API. Things to do above updated accordingly.
After group discussion, we'll wrap this after automating @dustine32 . So, how should we automate you? There are two obvious ways in my mind:
I think beyond those two, we'd likely be doing a bit more work. (I'm avoiding adding them to the main pipeline products for the moment, until we know what our roadmap will be.) Do either of these make more or less sense to you?
@kltm Thanks! My vote is for option 2 since a side effect might be that we get closer to something like a standard set of GO-CAM JSON products tied to GO releases (once this is running in the main pipeline). See https://github.com/geneontology/go-site/issues/1180#issuecomment-962278856 for a bit more detail.
For changing the the GO-CAM web app, I believe the steps are:
@dustine32 Okay, it looks like 1 is done and committed; I've created a PR https://github.com/geneontology/web-gocam/pull/18 . I think that's probably safe to merge, no? At worst, it might automatically update and problem solved. I've tested it locally and it appears to be going to the correct location.
For 2, this is a bit worrying: https://github.com/geneontology/web-gocam/blob/c4e4bf6cf4c190c757e40c9fbe47c3260907cfa6/deploy.sh#L2
I'm not wild about a recursive delete of production, but it otherwise seems straightforward. The local tools seem to work as advertised after doing an npm install
. However, I think that you are current the go-to person given the .cloud credentials?
@kltm Yep, I'll try to make sure it's deployed today, tip-toeing around the recursive delete (I'll prob have to do it but I'll see what I accomplish without it first).
I suppose a Friday afternoon is probably the best time to try things like this anyways. It will probably all go fine, but if you run into any hiccups, don't hesitate to ping me (or we could do it together if you want company).
Caught up with @dustine32 and updated the TODO list above. We'll revisit after this upcoming Friday.
Talked to @dustine32 and " complete transfer (or remapping) of S3 and CF resources to USC" completed.
Expanding on https://github.com/geneontology/pipeline/issues/265#issuecomment-1036846298: with the S3 and CF transfer to USC AWS, we now have control over the GO-CAM site code that is served on geneontology.cloud and thus from where the GO-CAM site will fetch the JSON files.
So, if we ever need to change JSON filenames or location, we just have to PR the changes (an example https://github.com/geneontology/web-gocam/pull/18) and run the deploy.sh script. Be sure to update the correct CF distribution ID in the deploy.sh
create-invalidation
cmd, which we can now find under the USC CF list.
The purpose of this item is to automatically generate:
and push to an appropriate S3 location. This takes over for https://github.com/geneontology/api-gorest-2021/issues/2 .
A possible set of tasks could be:
From a software call, above is the cutoff for closing this item. With future pipeline refactoring, we'd want to spin out the following:
As there is a manual workaround for the time being, while annoying, I'm giving it a less than an IT'S-ON-FIRE-! priority. Documentation for manual hack of file update/upload while we work things out: https://docs.google.com/document/d/18vYy9sZq-dyjYWW0mnw3XpXRJjlI7pbQWvMlSSdXdjA/edit#heading=h.tzx1g6nhmgtd
Tagging @dustine32 @kltm