Closed jmignault closed 5 years ago
Interestingly, @jmignault, we haven't tried a build on AWS yet, so could be related there?
Otherwise, more often than not, 502
errors at /combine
are when gunicorn is not running and you're seeing the generic nginx response.
Look a bit like this?
If so, the first step would probably be determining if a) gunicorn is running, and b) why not, if not.
The logs for gunicorn can be found at /var/log/gunicorn/gunicorn.stdout
and /var/log/gunicorn/gunicorn.stderr
.
Can be restarted with: sudo supervisorctl restart gunicorn
.
Or, combine can be run without gunicorn for debugging via the following:
cd /opt/combine
source activate combine
./runserver.py
# then, can access at HOST:8000/combine
Curious, was this a tagged release, or just default build from Combine-Playbook which defaults to master
branch of Combine?
It does look like that, but /combine
loads fine, it's /combine/system
that takes forever resulting in the 502 error.
gunicorn was running when I checked, I restarted it anyway and got the same results.
I installed via the combine-playbook repo, yes.
Apologies! I missed that it was /combine/system
in your comments. Yeah, 502
s, when gunicorn is running, are almost always timeouts.
That's a somewhat strange page for a timeout, as it's really only querying for Spark and Celery statuses, but if one of them were gummed up, could cause that.
A couple of things I would try:
1) restarting Livy and Celery, both are run in supervisor: sudo supervisorctl restart livy celery
2) if that doesn't help, try running Combine's Django app via runserver.py
as outlined above, wait for the page to load (if it does), which might provide more information.
Or, a server reboot, as most things are meant to come back up via supervisor.
Did you encounter this 502
first thing after an install?
If you paste the supervisor status here, that could be helpful as well: sudo supervisorctl status
e.g. (where my gunicorn is stopped):
celery RUNNING pid 24491, uptime 1:22:53
gunicorn STOPPED Oct 23 06:19 PM
hdfs:datanode RUNNING pid 1487, uptime 69 days, 22:04:26
hdfs:namenode RUNNING pid 1486, uptime 69 days, 22:04:26
livy RUNNING pid 879, uptime 17 days, 23:23:54
pyjxslt RUNNING pid 24664, uptime 55 days, 4:59:49
spark:spark_driver RUNNING pid 7394, uptime 25 days, 23:11:11
spark:spark_worker RUNNING pid 7393, uptime 25 days, 23:11:11
supervisor status:
gunicorn RUNNING pid 3278, uptime 0:16:36
hdfs:datanode RUNNING pid 1048, uptime 2:23:58
hdfs:namenode RUNNING pid 1047, uptime 2:23:58
livy RUNNING pid 3432, uptime 0:02:03
pyjxslt RUNNING pid 1062, uptime 2:23:58
spark:spark_driver RUNNING pid 1059, uptime 2:23:58
spark:spark_worker RUNNING pid 1058, uptime 2:23:58
Yet I still get the timeouts.
I had tried running the static file ingest from the Quickstart. The job failed immediately. I then tried setting up a record group and ingesting via a OAI endpoint, which also failed. There are 4 jobs in the system at the moment. Should I clear them out and start over? How would I go about that?
It's strange that celery
is not there, which is suspiciously one of the things that the /system
route is checking the status for.
Formerly it was using Django-Background-Tasks, but switch to celery a bit ago.
If you don't mind me asking, when did you clone and build?
It's possible the Jobs will look a little ugly in the system, but I wouldn't think they would prevent page loads. But, Organization / Record Group / and Job deletions from Combine are all handled as background tasks, so if those aren't functioning that's difficult.
I have to run to a meeting, but can followup with some python shell commands via /runconsole.py
from /opt/combine
for manually clearing things out.
But, if you have the facility, a fresh rebuild might be a good option. Without celery
present, wondering if it's an older version.
I don't mind at all. I cloned and built about 4 hours ago.
celery is actually there, it was a mistake in pasting.
Ah, good to know. Also curious, did you check out a release before provisioning? or begin the process from the default master
?
Curious, as I'm wondering why celery
is not a supervisor program. It's been there since v0.3
.
Do you see this block at the end of the file /etc/supervisor/supervisord.conf
?
[program:celery]
directory=/opt/combine
command=/usr/local/anaconda/envs/combine/bin/celery -A core worker -l info --concurrency 1
autostart=true
autorestart=true
stderr_logfile=/var/log/celery.stdout
stdout_logfile=/var/log/celery.stderr
user=combine
Or maybe it just didn't get copied when pasting here?
After installing Combine, Livy requires being started once (and should restart automatically after that). This is covered in the quickstart a bit, but that's also something we're working on redoing. That is likely why the Jobs failed, if Livy was never started. Ideally, in future versions, Livy will start even after the first server build as well.
Happy to keep troubleshooting, but that celery
process being gone from supervisor is what I would focus on, as it's suspicious. Is it possible the Ansible build errored out?
No, I'm sorry, celery was the first item in the output. It was a mistake in copying and pasting the output of sudo systemctl status
.
Also, I did not checkout a tagged release, but did just work from default master. Also, reading the playbook README, it looks like the machine may also have too little RAM - I configured the instance with 8GB and it says it needs 12-16. Maybe I'll just kill this instance and make a new one and try reinstalling. Is there a tag you suggest I use?
Ah, thanks for the sanity check. At least celery is installed!
So, didn't ask for it, but some backstory. We're in the process of figuring out a couple of deployment approaches, and it's currently kind of akwardly in the middle. We realized that 12-16GB of RAM is a big ask for some folks, and have been trying configurations for smaller builds. The good news, finding that smaller builds will work, but require some different parameters.
For example, building with Vagrant will build successfully on a 4gb machine, and works just fine up to 100k, 200k, 500k records (more the size of a single Job, than total records, but that's a different story). The Ansible build -- which it sounds like you're doing -- has default system configurations that work well on 12-16gb, but should likely still work at 8gb. I think.
If you do try re-installing, v0.3.1
is the most recent release, and built okay for me just last week. It's got a couple of little bugs that will be addressed in v0.3.2
or v0.4
, but nothing major. I would have thought your build of master
would have been fine, but I suppose it couldn't hurt to pin to v0.3.1
.
But restarting livy
and celery
would stop anything that was gumming things up, and it's a bit of mystery why it's hanging on /system
without digging around more.
Before a rebuild, perhaps a reboot might be worth a shot?
I've actually rebooted the instance a few times, it doesn't seem to be helping. I'll try it again and see if it improves.
ok, I've restarted. Opening /combine/system,
having done nothing else, yields this:
Why is that gone
there?
ubuntu@ip-172-31-17-104:~$ sudo supervisorctl status
celery RUNNING pid 1080, uptime 0:03:40
gunicorn RUNNING pid 1065, uptime 0:03:40
hdfs:datanode RUNNING pid 1059, uptime 0:03:40
hdfs:namenode RUNNING pid 1058, uptime 0:03:40
livy RUNNING pid 1070, uptime 0:03:40
pyjxslt RUNNING pid 1095, uptime 0:03:40
spark:spark_driver RUNNING pid 1091, uptime 0:03:40
spark:spark_worker RUNNING pid 1089, uptime 0:03:40
That's okay! Means the Livy Session is gone, and can't communicate with Spark. But, would try clicking "remove", and then, starting a new session when the page reloads.
Ideally, even if gone, will refresh when running a new job, but can't hurt to do manually. Looking for it to be green and say "idle".
Clicking the green "start a new session" button eventually results in the "Bad gateway" error.
ubuntu@ip-172-31-17-104:~$ sudo supervisorctl status
celery RUNNING pid 1080, uptime 0:09:58
gunicorn RUNNING pid 1065, uptime 0:09:58
hdfs:datanode RUNNING pid 1059, uptime 0:09:58
hdfs:namenode RUNNING pid 1058, uptime 0:09:58
livy RUNNING pid 1070, uptime 0:09:58
pyjxslt RUNNING pid 1095, uptime 0:09:58
spark:spark_driver RUNNING pid 1091, uptime 0:09:58
spark:spark_worker RUNNING pid 1089, uptime 0:09:58
All good to know. Starting to feel like it's port/firewall related.
Clicking the green button should reload immediately, indicating that Livy is starting:
But, if it's hanging, it's as if a LivySession
does in fact exist, but while attempting to ping Livy for an update to its status, it hangs. The Django app is attempting to ping localhost:8998
if these settings are untouched.
Wonder if it's possible if the AWS server has firewall settings such that these internal requests are hanging? Might try and spin up an AWS intance tomorrow to test, it's high time we tried there too.
Sorry for the inconvenience!
Hi @jmignault, a bit of good news, if you can call it that. Built an instance on AWS, and was able to recreate the issues you are reporting (hanging on Livy session start, eventual 502
from Gunicorn).
Don't have an answer yet, but glad to see the problem is reproducable, and would seem to be unique to AWS builds, which is a great thing to figure out. Thanks for raising this, and will follow up here.
Ah, good to know. I opened 8998 this morning and Livy runs fine now. I deleted the existing stuck jobs. and that worked fine. Running a new job however initially looked good - status was "waiting" - but it finishes in seconds and results in no records. This is attempting to harvest via OAI from one of our partners, which I have been able to harvest in REPOX. It's about 6k records.
I was going to attempt the static XML harvest again, but since rebooting the files in /tmp have disappeared. Are they available somewhere else?
Pretty new to AWS, this is interesting. I was just testing a static harvest, and noticed it was hanging in the last stage, indexing to ElasticSearch that runs on port 9200
. As it was hung, I was able to open that port through AWS console and watched the Job finish.
I could be wrong, but this suggests to me that internal ports might need to be manually opened -- er, opened through the AWS console -- for Combine to work? I had assumed these firewalls / AWS security groups were largely from the outside --> in, but seem to effect internally as well. Which is possible, as I type this out, believe some calls are to the application IP (had envisioned the systems living on different servers potentially), which would trigger these rules. That's something for future Combine configuring, the ability to explicitly set these as localhost
or 127.0.0.1
to avoid these issues.
But, for now, it appears to work by opening up the following ports, which are various systems in play:
8998
: Livy, communicates with Spark9200
: ElasticSearch, used for indexing records4040-4050
: IP Range that Spark Application uses, beginning at 4040
but climbs if not available8080
: local Spark cluster80
: nginx to view gunicorn running Django app8000
: testing port for running Django app via runserver.py
That's a bit of a pain to setup, but hopefully once that's done, should work more smoothly.
As for why the OAI harvest didn't work, one place to look is clicking the "Monitor" button from the Jobs table. This will take you to the "Spark Details" tab in the Job details, which can have useful information.
A Job with no real errors, looks something like this for the "LIvy Statement Information":
{
"code": "from jobs import HarvestStaticXMLSpark\nHarvestStaticXMLSpark(spark, job_id=\"1\").spark_function()",
"id": 0,
"output": {
"data": {
"text/plain": "[DEBUG] 2018-10-24 13:00:22,782 core.apps.ready (35): Core application ready method preperations firing\n/usr/local/anaconda/envs/combine/lib/python3.5/site-packages/django/db/models/fields/__init__.py:1451: RuntimeWarning: DateTimeField JobTrack.finish_timestamp received a naive datetime (2018-10-24 13:01:13.620120) while time zone support is active.\n RuntimeWarning)"
},
"execution_count": 0,
"status": "ok"
},
"progress": 1,
"state": "available"
}
But, if there are problems, they will show up here. I'd say one of Combine's biggest strengths and weaknesses, is running jobs in Spark. The pros: very performant and efficient. The cons: hard to bubble errors, as it's a sitting behind Livy as well.
One thing to look out for with OAI harvests are redirects. For example, our own http://digital.library.wayne.edu/api/oai
will redirect to https
, but Combine won't follow this. I could be mistaken, but I believe this is a limitation of the DPLA Ingestion 3 code which handles OAI harvesting.
Meant to add: once I opened up all those ports, not seeing it hang anywhere, even under Gunicorn (port 80
).
Very good news: I added port 9200 and just successfully harvested 128 records from Fordham University.
Great! Thanks again for raising this, really good to know.
If you get curious about other settings, and possible workarounds for these ports, the primary Combine configuration file can be found at /opt/combine/combine/localsettings.py
.
Undoubtedly there is room for improvement in internal network traffic.
ok, I'm going to dig into it a bit further. I would prefer to not have to open all those ports to everywhere. Where should I continue communicating with you?
We have a slack workspace that has worked pretty well, I'll send along an invite to the email I have on hand, email is also good.
Might close this issue here.
I've installed combine on an AWS instance via ansible. It's 2 cores, 8GB RAM, 20 GB disk, Ubuntu 18.04, Python 2.7. Whenever I try to navigate to
/combine/system
, I get a number of 502 Bad Gateway errors like:2018/10/23 18:09:06 [error] 847#847: *39 upstream prematurely closed connection while reading response header from upstream, client: 172.254.73.146, server: 34.218.81.166, request: "GET /combine/system HTTP/1.1", upstream: "http://unix:/opt/combine/combine.sock:/combine/system", host: "34.218.81.166", referrer: "http://34.218.81.166/combine/"
Any ideas?