Knowledge-Graph-Hub / kg-covid-19

An instance of KG Hub to produce a knowledge graph for COVID-19 response.
https://github.com/Knowledge-Graph-Hub/kg-covid-19/wiki
BSD 3-Clause "New" or "Revised" License
78 stars 26 forks source link

Problem pushing blazegraph journal to SPARQL endpoint in Jenkins #422

Closed justaddcoffee closed 3 years ago

justaddcoffee commented 3 years ago

Describe the bug

In the last "Deploy blazegraph" stage of the Jenkins pipeline, I'm getting an ssh authentication error, see here:

19:35:53  + pwd
19:35:53  + HOME=/var/lib/jenkins/workspace/dge-graph-hub_kg-covid-19_master/ansible
19:35:53  + ansible-playbook update-kg-hub-endpoint.yaml --inventory=hosts.local-rdf-endpoint --private-key=**** -e target_user=bbop --extra-vars=endpoint=internal
19:35:54  [DEPRECATION WARNING]: The TRANSFORM_INVALID_GROUP_CHARS settings is set to 
19:35:54  allow bad characters in group names by default, this will change, but still be 
19:35:54  user configurable on deprecation. This feature will be removed in version 2.10.
19:35:54   Deprecation warnings can be disabled by setting deprecation_warnings=False in 
19:35:54  ansible.cfg.
19:35:54  [WARNING]: Invalid characters were found in group names but not replaced, use
19:35:54  -vvvv to see details
19:35:54  
19:35:54  PLAY [pipeline-rdf] ************************************************************
19:35:54  
19:35:54  TASK [Gathering Facts] *********************************************************
19:35:54  [WARNING]: Unhandled error in Python interpreter discovery for host
19:35:54  pan.lbl.gov: Failed to connect to the host via ssh: Host key verification
19:35:54  failed.
19:35:54  fatal: [pan.lbl.gov]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"pan.lbl.gov\". Make sure this host can be reached over ssh: Host key verification failed.\r\n", "unreachable": true}
19:35:54  
19:35:54  PLAY RECAP *********************************************************************
19:35:54  pan.lbl.gov                : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0   
19:35:54  

@kltm, can you remind me how authentication works in order for the ansible playbook to execute properly? Do we need to provide this Docker container an ssh key or something?

To Reproduce

Go here

Expected behavior

Should push blazegraph journal to our SPARQL endpoint

Version

This commit

kltm commented 3 years ago

Yup--a private key for the jenkins user (or accessible to whatever user inside a container). Should be ansible-bbop-local-slave.

kltm commented 3 years ago

It might be worth testing outside of the container to make sure it works. If it does, there is likely some combination of path manipulation ("HOME=pwd"?) or docker weirdness that is causing problems.

justaddcoffee commented 3 years ago

Thanks @kltm

It might be worth testing outside of the container to make sure it works.

Well, it's been working for a year or so outside the container, and it works, so it's likely to do with the container. I'm giving Jenkins the credentials here, so maybe it can't find this file when it's in the docker container

justaddcoffee commented 3 years ago

If it does, there is likely some combination of path manipulation ("HOME=pwd"?)

FWIW, this isn't the issue - I've removed the HOME=pwd business and it fails in the same way:

13:57:09  pan.lbl.gov: Failed to connect to the host via ssh: Host key verification
13:57:09  failed.
kltm commented 3 years ago

Okay, I think the issue is that the file that you're trying to use in this case either 1) does not exist or 2) has the wrong permissions/mod to be used for the given task. What you're working with is the "file" credential binding (https://www.jenkins.io/doc/pipeline/steps/credentials-binding/). This file (should) exist for real on the filesystem for this to work. I don't believe that Jenkins copies anything into the docker image, rather binds things in various cute ways through runtime variables and volume mounts. E.g.

$ docker run -t -d -u 114:120 -w /var/lib/jenkins/workspace/vid-19_check_ansible_run_jenkins -v /var/lib/jenkins/workspace/vid-19_check_ansible_run_jenkins:/var/lib/jenkins/workspace/vid-19_check_ansible_run_jenkins:rw,z -v /var/lib/jenkins/workspace/vid-19_check_ansible_run_jenkins@tmp:/var/lib/jenkins/workspace/vid-19_check_ansible_run_jenkins@tmp:rw,z -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** justaddcoffee/ubuntu20-python-3-8-5-dev:4 cat

The credential that you need likely exists in one of the mounted volumes and it's exact runtime bound location hidden in one of those variables.

The question here is how probe all of this without actually exposing any secrets that we wouldn't want public. Please be conservative in using messages for debugging here--an accidental exposure would be painful.

Possibilities:

If I had to guess at this point, I'd say that playing with permissions/users elsewhere could have caused something like this to happen--the wrong perms or user would prevent an ssh key from getting used by an alien caller. OTOH, given the shell game of passing files through different levels, something getting lost doesn't strike me as too too unlikely either, even though Jenkins is supposedly designed for this.

justaddcoffee commented 3 years ago

Thanks very much @kltm - I have done this:

You get the system to that point and have it take a looong nop; somebody then invades the image to try and figure out what's going on. (Lowest risk, but time consuming and annoying to coordinate.)

Here with an infinite loop: https://build.berkeleybop.io/job/knowledge-graph-hub/job/kg-covid-19/job/check_ansible_run_jenkins/14/console would you mind invading that image and having a look to see why it can't find $DEPLOY_LOCAL_IDENTITY?

kltm commented 3 years ago

Okay, yeah. What I'm seeing does not seem that great. Do you have a channel where we could chat?

justaddcoffee commented 3 years ago

BBOP slack?

justaddcoffee commented 3 years ago

Long story short, ansible-playbook command fails because it doesn't have an entry in ~/.ssh/known_hosts for pan.lbl.gov (since this is running in Docker).

Fix is fairly simple - just do this before we run ansible:

                        sh 'mkdir -p ~/.ssh/'
                        sh 'ssh-keyscan -H pan.lbl.gov >> ~/.ssh/known_hosts'

Thanks again @kltm for help in sorting this out