geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

Add JSON product production for GO-CAM API to pipeline #265

Open kltm opened 2 years ago

kltm commented 2 years ago

The purpose of this item is to automatically generate:

and push to an appropriate S3 location. This takes over for https://github.com/geneontology/api-gorest-2021/issues/2 .

A possible set of tasks could be:

From a software call, above is the cutoff for closing this item. With future pipeline refactoring, we'd want to spin out the following:

As there is a manual workaround for the time being, while annoying, I'm giving it a less than an IT'S-ON-FIRE-! priority. Documentation for manual hack of file update/upload while we work things out: https://docs.google.com/document/d/18vYy9sZq-dyjYWW0mnw3XpXRJjlI7pbQWvMlSSdXdjA/edit#heading=h.tzx1g6nhmgtd

Tagging @dustine32 @kltm

kltm commented 2 years ago

Working on issue-265-go-cam-products branch.

kltm commented 2 years ago

@dustine32 So, much to my amazement, we seem to have something that is beginning to work...

http://skyhook.berkeleybop.org/issue-265-go-cam-products/products/api-static-files/

I'd like a little feedback and information from you, but there is a start here. I was trying to make something that would work without the modification of any upstreamand without making new docker images, so there is some weirdness in there (e.g. creating new runtime scripts for the blazegraph docker environment, using sed to make runtimes and other changes on the fly, setting up maven and installing nodejs after the fact, nested working directories), but there is a working base here nonetheless. So, questions:

  1. Is what is in there understandable? From: https://github.com/geneontology/pipeline/blob/1a81657349ac564315e526f496f218d6f85adbc7/Jenkinsfile#L357
  2. What changes could be made to the upstream repos that would a) keep them sensible and clean but 2) make this pipeline a little less hacky?
  3. What are the commands to get the rest of the required JSON files?
  4. What form do the JSON files need to be in? Gzipped even though the extension is wrong?
  5. Anything else you might think of...?

A lot of questions there, so if it's easier, we can touch bases on voice.

dustine32 commented 2 years ago

@kltm Whoa. I'm amazed you were able to hack all of my "manual" commands in to the pipeline. Great job!

  1. Yep. Makes total sense. Though I am wondering about how we will be choosing the blazegraph-production.jnl (in the end, after the rest of the coding/testing is worked out). If a blazegraph-production.jnl is produced in release, snapshot, and master, can we just default to grabbing whatever blazegraph-production.jnl is already produced/laying around locally in that branch/run? Or like, wget -N http://skyhook.berkeleybop.org/$BRANCH_NAME/products/blazegraph/blazegraph-production.jnl.gz?
  2. We can probably get rid of the jetty server and api-gorest-2021 app and replace with these components: a. blazegraph-runner cmd b. SPARQL query files (4 of them) - stored in go-site/pipeline/sparql/reports/ c. A small script to convert (handle grouping) the blazegraph-runner output to JSON structure expected by GO-CAM site
  3. I just committed the other two API cmds for the remaining files.
  4. Not yet sure if they actually need to be gzipped. We can test by having a dev instance of web-gocam point to the skyhook files and adjust until it works?
  5. Ummm... hmmm...
dustine32 commented 2 years ago

@kltm Actually, I'm playing around with blazegraph-runner locally now and I realized having 4 separate blazegraph-runner cmds means loading the journal 4 times, which is taking a while. So now I'm appreciating that "load-journal-once" jetty endpoint. I'm thinking we just keep that part?

kltm commented 2 years ago

@dustine32 Okay, comments on comments...

  1. Correct. There are two ways forward: if (when) this gets folded into the core pipeline, it would grab the journal from inside the pipeline and use it before publications; if this remains outside of there core pipeline, it can continue to grab from "current.go.org", as that will be the latest and likely just created. The former is better as it means we can try different loads as experiments for the GO-CAM API, etc.
  2. Re: sparql/jetty vs blazegraph-runner. In the best of all worlds and for a bunch of reasons, I'd rather have a bunch of cli commands instead of servers being spun up and down. For the main pipeline, I think that spinning up blazegraph-runner is a small consideration compared to the simplicity of just having commands. Moreover, if it was irritating enough, we could parallelize or make blzegraph-runner handle batches or something, probably without too much trouble. As well, having just cli from repos would make it easier to bake a single-purpose and easy-to-use docker image to handle all of these things.
    My concern for the moment was something along the lines of how to handle the SPARQL output and how to convert it properly into the JSON that's needed? It may be that that's essentially locked into the JS and hard to extract or make cli conversion tools for. Or maybe not? I guess I'm wondering how much conversion is necessary and how hard would it be to extract? Ideally, there would also be a JSON schema so we knew we were doing the right thing, but I think we'd probably just want to move on quickly as this area of the stack will see some evolution this year.
  3. Cheers! I'll re-reun and see what we get. As well, I'll turn on the "full" data load for the next test.
  4. Okay, sounds good. The wrong extension thing kinda weirds me out, so if we can avoid that all the better. For the moment, I'll also make some gzipped products with the proper extension to see if that works as well. Either way, I'd prefer file names that match content.
balhoff commented 2 years ago

@kltm Actually, I'm playing around with blazegraph-runner locally now and I realized having 4 separate blazegraph-runner cmds means loading the journal 4 times, which is taking a while. So now I'm appreciating that "load-journal-once" jetty endpoint. I'm thinking we just keep that part?

@dustine32 if I need to run queries in parallel I would build the journal in another target and then cp the journal to a new file in each target before running. Or else just run all queries in one target (probably makes the most sense).

kltm commented 2 years ago

@dustine32 The "full" test run on the production file only takes ten minutes on this end, which is pretty good, especially as I can see things that can easily be sped up, like using pigz. So, the full versions are now available (with "correct" extensions) on S3 at places like:

https://go-public.s3.amazonaws.com/files/gocam-goterms.json https://go-public.s3.amazonaws.com/files/gocam-goterms.json.gz

and so on. They are also available on skyhook, but those might disappear during runs.

I guess this puts this back into your court with testing against the S3 products to see if they work?

dustine32 commented 2 years ago

@kltm Running a local instance of https://github.com/geneontology/web-gocam/commit/ada645bd17851e09ac561749ec7c2367939b17c5, I tested and confirmed the non-gzipped URLs work with the GO-CAM browser site.

kltm commented 2 years ago

@dustine32 Okay, great. Since they aren't too large, I'm going to go ahead and remove the gzipped versions from our new pipeline and deployment.

kltm commented 2 years ago

@dustine32 Okay, done. The next step above "GO-CAM API at new products (temporary)" could technically be a stable terminal state (even though we don't want it to be), so a little less worry for us. I think this one is probably on your plate? Would you like people to work on that with you and spread the knowledge? Also, it's probably good to update our internal documentation for this new stable state, even though it's meant to be temporary.

kltm commented 2 years ago

Talked to @dustine32 and he clarified some of my confusion: this only needs to update the GO-CAM website, not the GO-CAM API. Things to do above updated accordingly.

kltm commented 2 years ago

After group discussion, we'll wrap this after automating @dustine32 . So, how should we automate you? There are two obvious ways in my mind:

  1. We get the USC credentials, put them into the pipeline, and push direct
    pro: we keep what we have so there is almost no chance of side effects; con: we maintain Yet Another Data Drop Point
  2. We aim the GO-CAM web app at the newly minted S3 products (the API does not use them separately)
    pro: easy to do(?); con: maybe a higher chance of side effects (i.e. something else is consuming those files that we forgot or don't know about)

I think beyond those two, we'd likely be doing a bit more work. (I'm avoiding adding them to the main pipeline products for the moment, until we know what our roadmap will be.) Do either of these make more or less sense to you?

dustine32 commented 2 years ago

@kltm Thanks! My vote is for option 2 since a side effect might be that we get closer to something like a standard set of GO-CAM JSON products tied to GO releases (once this is running in the main pipeline). See https://github.com/geneontology/go-site/issues/1180#issuecomment-962278856 for a bit more detail.

For changing the the GO-CAM web app, I believe the steps are:

  1. Update JSON endpoint URLs in web-gocam here.
  2. Deploy web-gocam changes to S3 static site - Exact details on this are murky right now. But it looks like this deploy.sh script is a good place to start.
kltm commented 2 years ago

@dustine32 Okay, it looks like 1 is done and committed; I've created a PR https://github.com/geneontology/web-gocam/pull/18 . I think that's probably safe to merge, no? At worst, it might automatically update and problem solved. I've tested it locally and it appears to be going to the correct location.

For 2, this is a bit worrying: https://github.com/geneontology/web-gocam/blob/c4e4bf6cf4c190c757e40c9fbe47c3260907cfa6/deploy.sh#L2 I'm not wild about a recursive delete of production, but it otherwise seems straightforward. The local tools seem to work as advertised after doing an npm install. However, I think that you are current the go-to person given the .cloud credentials?

dustine32 commented 2 years ago

@kltm Yep, I'll try to make sure it's deployed today, tip-toeing around the recursive delete (I'll prob have to do it but I'll see what I accomplish without it first).

kltm commented 2 years ago

I suppose a Friday afternoon is probably the best time to try things like this anyways. It will probably all go fine, but if you run into any hiccups, don't hesitate to ping me (or we could do it together if you want company).

kltm commented 2 years ago

Caught up with @dustine32 and updated the TODO list above. We'll revisit after this upcoming Friday.

kltm commented 2 years ago

Talked to @dustine32 and " complete transfer (or remapping) of S3 and CF resources to USC" completed.

dustine32 commented 2 years ago

Expanding on https://github.com/geneontology/pipeline/issues/265#issuecomment-1036846298: with the S3 and CF transfer to USC AWS, we now have control over the GO-CAM site code that is served on geneontology.cloud and thus from where the GO-CAM site will fetch the JSON files.

So, if we ever need to change JSON filenames or location, we just have to PR the changes (an example https://github.com/geneontology/web-gocam/pull/18) and run the deploy.sh script. Be sure to update the correct CF distribution ID in the deploy.sh create-invalidation cmd, which we can now find under the USC CF list.