MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

Getting 502 errors #332

Closed jmignault closed 5 years ago

jmignault commented 5 years ago

I've installed combine on an AWS instance via ansible. It's 2 cores, 8GB RAM, 20 GB disk, Ubuntu 18.04, Python 2.7. Whenever I try to navigate to /combine/system, I get a number of 502 Bad Gateway errors like:

2018/10/23 18:09:06 [error] 847#847: *39 upstream prematurely closed connection while reading response header from upstream, client: 172.254.73.146, server: 34.218.81.166, request: "GET /combine/system HTTP/1.1", upstream: "http://unix:/opt/combine/combine.sock:/combine/system", host: "34.218.81.166", referrer: "http://34.218.81.166/combine/"

Any ideas?

ghukill commented 5 years ago

Interestingly, @jmignault, we haven't tried a build on AWS yet, so could be related there?

Otherwise, more often than not, 502 errors at /combine are when gunicorn is not running and you're seeing the generic nginx response.

Look a bit like this?

screen shot 2018-10-23 at 2 19 16 pm

If so, the first step would probably be determining if a) gunicorn is running, and b) why not, if not.

The logs for gunicorn can be found at /var/log/gunicorn/gunicorn.stdout and /var/log/gunicorn/gunicorn.stderr.

Can be restarted with: sudo supervisorctl restart gunicorn.

Or, combine can be run without gunicorn for debugging via the following:

cd /opt/combine
source activate combine
./runserver.py

# then, can access at HOST:8000/combine

Curious, was this a tagged release, or just default build from Combine-Playbook which defaults to master branch of Combine?

jmignault commented 5 years ago

It does look like that, but /combine loads fine, it's /combine/system that takes forever resulting in the 502 error.

gunicorn was running when I checked, I restarted it anyway and got the same results.

I installed via the combine-playbook repo, yes.

ghukill commented 5 years ago

Apologies! I missed that it was /combine/system in your comments. Yeah, 502s, when gunicorn is running, are almost always timeouts.

That's a somewhat strange page for a timeout, as it's really only querying for Spark and Celery statuses, but if one of them were gummed up, could cause that.

A couple of things I would try:

1) restarting Livy and Celery, both are run in supervisor: sudo supervisorctl restart livy celery

2) if that doesn't help, try running Combine's Django app via runserver.py as outlined above, wait for the page to load (if it does), which might provide more information.

Or, a server reboot, as most things are meant to come back up via supervisor.

Did you encounter this 502 first thing after an install?

ghukill commented 5 years ago

If you paste the supervisor status here, that could be helpful as well: sudo supervisorctl status

e.g. (where my gunicorn is stopped):

celery                           RUNNING   pid 24491, uptime 1:22:53
gunicorn                         STOPPED   Oct 23 06:19 PM
hdfs:datanode                    RUNNING   pid 1487, uptime 69 days, 22:04:26
hdfs:namenode                    RUNNING   pid 1486, uptime 69 days, 22:04:26
livy                             RUNNING   pid 879, uptime 17 days, 23:23:54
pyjxslt                          RUNNING   pid 24664, uptime 55 days, 4:59:49
spark:spark_driver               RUNNING   pid 7394, uptime 25 days, 23:11:11
spark:spark_worker               RUNNING   pid 7393, uptime 25 days, 23:11:11
jmignault commented 5 years ago

supervisor status:

gunicorn                         RUNNING   pid 3278, uptime 0:16:36
hdfs:datanode                    RUNNING   pid 1048, uptime 2:23:58
hdfs:namenode                    RUNNING   pid 1047, uptime 2:23:58
livy                             RUNNING   pid 3432, uptime 0:02:03
pyjxslt                          RUNNING   pid 1062, uptime 2:23:58
spark:spark_driver               RUNNING   pid 1059, uptime 2:23:58
spark:spark_worker               RUNNING   pid 1058, uptime 2:23:58

Yet I still get the timeouts.

I had tried running the static file ingest from the Quickstart. The job failed immediately. I then tried setting up a record group and ingesting via a OAI endpoint, which also failed. There are 4 jobs in the system at the moment. Should I clear them out and start over? How would I go about that?

ghukill commented 5 years ago

It's strange that celery is not there, which is suspiciously one of the things that the /system route is checking the status for.

Formerly it was using Django-Background-Tasks, but switch to celery a bit ago.

If you don't mind me asking, when did you clone and build?

It's possible the Jobs will look a little ugly in the system, but I wouldn't think they would prevent page loads. But, Organization / Record Group / and Job deletions from Combine are all handled as background tasks, so if those aren't functioning that's difficult.

I have to run to a meeting, but can followup with some python shell commands via /runconsole.py from /opt/combine for manually clearing things out.

But, if you have the facility, a fresh rebuild might be a good option. Without celery present, wondering if it's an older version.

jmignault commented 5 years ago

I don't mind at all. I cloned and built about 4 hours ago.

celery is actually there, it was a mistake in pasting.

ghukill commented 5 years ago

Ah, good to know. Also curious, did you check out a release before provisioning? or begin the process from the default master?

Curious, as I'm wondering why celery is not a supervisor program. It's been there since v0.3.

Do you see this block at the end of the file /etc/supervisor/supervisord.conf?

[program:celery]
directory=/opt/combine
command=/usr/local/anaconda/envs/combine/bin/celery -A core worker -l info --concurrency 1
autostart=true
autorestart=true
stderr_logfile=/var/log/celery.stdout
stdout_logfile=/var/log/celery.stderr
user=combine

Or maybe it just didn't get copied when pasting here?

After installing Combine, Livy requires being started once (and should restart automatically after that). This is covered in the quickstart a bit, but that's also something we're working on redoing. That is likely why the Jobs failed, if Livy was never started. Ideally, in future versions, Livy will start even after the first server build as well.

Happy to keep troubleshooting, but that celery process being gone from supervisor is what I would focus on, as it's suspicious. Is it possible the Ansible build errored out?

jmignault commented 5 years ago

No, I'm sorry, celery was the first item in the output. It was a mistake in copying and pasting the output of sudo systemctl status.

Also, I did not checkout a tagged release, but did just work from default master. Also, reading the playbook README, it looks like the machine may also have too little RAM - I configured the instance with 8GB and it says it needs 12-16. Maybe I'll just kill this instance and make a new one and try reinstalling. Is there a tag you suggest I use?

ghukill commented 5 years ago

Ah, thanks for the sanity check. At least celery is installed!

So, didn't ask for it, but some backstory. We're in the process of figuring out a couple of deployment approaches, and it's currently kind of akwardly in the middle. We realized that 12-16GB of RAM is a big ask for some folks, and have been trying configurations for smaller builds. The good news, finding that smaller builds will work, but require some different parameters.

For example, building with Vagrant will build successfully on a 4gb machine, and works just fine up to 100k, 200k, 500k records (more the size of a single Job, than total records, but that's a different story). The Ansible build -- which it sounds like you're doing -- has default system configurations that work well on 12-16gb, but should likely still work at 8gb. I think.

If you do try re-installing, v0.3.1 is the most recent release, and built okay for me just last week. It's got a couple of little bugs that will be addressed in v0.3.2 or v0.4, but nothing major. I would have thought your build of master would have been fine, but I suppose it couldn't hurt to pin to v0.3.1.

But restarting livy and celery would stop anything that was gumming things up, and it's a bit of mystery why it's hanging on /system without digging around more.

Before a rebuild, perhaps a reboot might be worth a shot?

jmignault commented 5 years ago

I've actually rebooted the instance a few times, it doesn't seem to be helping. I'll try it again and see if it improves.

jmignault commented 5 years ago

ok, I've restarted. Opening /combine/system, having done nothing else, yields this:

screenshot 2018-10-23 15 42 15

Why is that gone there?

ubuntu@ip-172-31-17-104:~$ sudo supervisorctl status
celery                           RUNNING   pid 1080, uptime 0:03:40
gunicorn                         RUNNING   pid 1065, uptime 0:03:40
hdfs:datanode                    RUNNING   pid 1059, uptime 0:03:40
hdfs:namenode                    RUNNING   pid 1058, uptime 0:03:40
livy                             RUNNING   pid 1070, uptime 0:03:40
pyjxslt                          RUNNING   pid 1095, uptime 0:03:40
spark:spark_driver               RUNNING   pid 1091, uptime 0:03:40
spark:spark_worker               RUNNING   pid 1089, uptime 0:03:40
ghukill commented 5 years ago

That's okay! Means the Livy Session is gone, and can't communicate with Spark. But, would try clicking "remove", and then, starting a new session when the page reloads.

Ideally, even if gone, will refresh when running a new job, but can't hurt to do manually. Looking for it to be green and say "idle".

jmignault commented 5 years ago

Clicking the green "start a new session" button eventually results in the "Bad gateway" error.

ubuntu@ip-172-31-17-104:~$ sudo supervisorctl status
celery                           RUNNING   pid 1080, uptime 0:09:58
gunicorn                         RUNNING   pid 1065, uptime 0:09:58
hdfs:datanode                    RUNNING   pid 1059, uptime 0:09:58
hdfs:namenode                    RUNNING   pid 1058, uptime 0:09:58
livy                             RUNNING   pid 1070, uptime 0:09:58
pyjxslt                          RUNNING   pid 1095, uptime 0:09:58
spark:spark_driver               RUNNING   pid 1091, uptime 0:09:58
spark:spark_worker               RUNNING   pid 1089, uptime 0:09:58
ghukill commented 5 years ago

All good to know. Starting to feel like it's port/firewall related.

Clicking the green button should reload immediately, indicating that Livy is starting:

selection_206

But, if it's hanging, it's as if a LivySession does in fact exist, but while attempting to ping Livy for an update to its status, it hangs. The Django app is attempting to ping localhost:8998 if these settings are untouched.

Wonder if it's possible if the AWS server has firewall settings such that these internal requests are hanging? Might try and spin up an AWS intance tomorrow to test, it's high time we tried there too.

Sorry for the inconvenience!

ghukill commented 5 years ago

Hi @jmignault, a bit of good news, if you can call it that. Built an instance on AWS, and was able to recreate the issues you are reporting (hanging on Livy session start, eventual 502 from Gunicorn).

Don't have an answer yet, but glad to see the problem is reproducable, and would seem to be unique to AWS builds, which is a great thing to figure out. Thanks for raising this, and will follow up here.

jmignault commented 5 years ago

Ah, good to know. I opened 8998 this morning and Livy runs fine now. I deleted the existing stuck jobs. and that worked fine. Running a new job however initially looked good - status was "waiting" - but it finishes in seconds and results in no records. This is attempting to harvest via OAI from one of our partners, which I have been able to harvest in REPOX. It's about 6k records.

I was going to attempt the static XML harvest again, but since rebooting the files in /tmp have disappeared. Are they available somewhere else?

ghukill commented 5 years ago

Pretty new to AWS, this is interesting. I was just testing a static harvest, and noticed it was hanging in the last stage, indexing to ElasticSearch that runs on port 9200. As it was hung, I was able to open that port through AWS console and watched the Job finish.

I could be wrong, but this suggests to me that internal ports might need to be manually opened -- er, opened through the AWS console -- for Combine to work? I had assumed these firewalls / AWS security groups were largely from the outside --> in, but seem to effect internally as well. Which is possible, as I type this out, believe some calls are to the application IP (had envisioned the systems living on different servers potentially), which would trigger these rules. That's something for future Combine configuring, the ability to explicitly set these as localhost or 127.0.0.1 to avoid these issues.

But, for now, it appears to work by opening up the following ports, which are various systems in play:

That's a bit of a pain to setup, but hopefully once that's done, should work more smoothly.

As for why the OAI harvest didn't work, one place to look is clicking the "Monitor" button from the Jobs table. This will take you to the "Spark Details" tab in the Job details, which can have useful information.

A Job with no real errors, looks something like this for the "LIvy Statement Information":

{
  "code": "from jobs import HarvestStaticXMLSpark\nHarvestStaticXMLSpark(spark, job_id=\"1\").spark_function()",
  "id": 0,
  "output": {
    "data": {
      "text/plain": "[DEBUG] 2018-10-24 13:00:22,782 core.apps.ready (35): Core application ready method preperations firing\n/usr/local/anaconda/envs/combine/lib/python3.5/site-packages/django/db/models/fields/__init__.py:1451: RuntimeWarning: DateTimeField JobTrack.finish_timestamp received a naive datetime (2018-10-24 13:01:13.620120) while time zone support is active.\n  RuntimeWarning)"
    },
    "execution_count": 0,
    "status": "ok"
  },
  "progress": 1,
  "state": "available"
}

But, if there are problems, they will show up here. I'd say one of Combine's biggest strengths and weaknesses, is running jobs in Spark. The pros: very performant and efficient. The cons: hard to bubble errors, as it's a sitting behind Livy as well.

One thing to look out for with OAI harvests are redirects. For example, our own http://digital.library.wayne.edu/api/oai will redirect to https, but Combine won't follow this. I could be mistaken, but I believe this is a limitation of the DPLA Ingestion 3 code which handles OAI harvesting.

ghukill commented 5 years ago

Meant to add: once I opened up all those ports, not seeing it hang anywhere, even under Gunicorn (port 80).

jmignault commented 5 years ago

Very good news: I added port 9200 and just successfully harvested 128 records from Fordham University.

ghukill commented 5 years ago

Great! Thanks again for raising this, really good to know.

If you get curious about other settings, and possible workarounds for these ports, the primary Combine configuration file can be found at /opt/combine/combine/localsettings.py.

Undoubtedly there is room for improvement in internal network traffic.

jmignault commented 5 years ago

ok, I'm going to dig into it a bit further. I would prefer to not have to open all those ports to everywhere. Where should I continue communicating with you?

ghukill commented 5 years ago

We have a slack workspace that has worked pretty well, I'll send along an invite to the email I have on hand, email is also good.

Might close this issue here.