Open gfazioli opened 9 years ago
The place to start is probably looking at the fleet engine's logs and looking for errors there (in particular at the time you are receiving 500s on the client)
On Mon, Sep 7, 2015 at 8:48 AM, Giovambattista Fazioli < notifications@github.com> wrote:
Hi there, I have a PHP loop with 10, 20 or 30 CURL to Fleet API. I used it as STRESS TEST
CoreOS stable 723.3.0 fleetctl 0.10.2
This is my method class
public function putUnit( $unitId, $options, $desiredState = 'launched' ) { // Path $path = '/fleet/v1/units/' . $unitId; // Build the post fields - default only the desired state $postFields = [ 'desiredState' => $desiredState, 'options' => $options ]; $postFields = json_encode( $postFields ); $ch = curl_init( $this->endpoint . $path ); curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true ); curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true ); curl_setopt( $ch, CURLOPT_CUSTOMREQUEST, 'PUT' ); curl_setopt( $ch, CURLOPT_POSTFIELDS, $postFields ); curl_setopt( $ch, CURLOPT_HTTPHEADER, array( 'Content-Type: application/json', 'Content-Length: ' . strlen( $postFields ) ) ); $res = curl_exec( $ch ); $httpStatus = curl_getinfo( $ch, CURLINFO_HTTP_CODE ); curl_close( $ch ); $result = [ 'result' => $res, 'status' => $httpStatus, 'unit' => $unitId, 'options' => $options, 'desiredState' => $desiredState ]; return $result; }
But sometimes (randomly), the $res = curl_exec( $ch ); return 500 HTTP Status with a blank human-readable message. Consider that $unitId and $options are very simple and I keep them equals for each test.
Any suggestions?
Thanks in advance
— Reply to this email directly or view it on GitHub https://github.com/coreos/fleet/issues/1345.
@gfazioli @jonboulle , yes, Http 500 error is the "Internal server error", it always does not give the exact internal reason to the browser. Need check the fleetd's logs.
Ok, the error is
ERROR units.go:223: Failed creating Unit(web_test-app-40.johnf.service) in Registry: timeout reached
Any ideas?
That's indicative of etcd taking too long to respond, and fleet giving up on the request - so now it seems most likely that performance of etcd is your culprit. You could start exploring some of the metrics it exposes to analyse its performance - https://github.com/coreos/etcd/blob/master/Documentation/metrics.md
On Tue, Sep 8, 2015 at 10:23 AM, Giovambattista Fazioli < notifications@github.com> wrote:
Ok, the error is
ERROR units.go:223: Failed creating Unit(web_test-app-40.johnf.service) in Registry: timeout reached
Any ideas?
— Reply to this email directly or view it on GitHub https://github.com/coreos/fleet/issues/1345#issuecomment-138640231.
@jonboulle Ok, I will try, thx
Yes, seems like etcd is the culprit. Maybe not the performance issue, maybe etcd is stopping or some error.
func (ur *unitsResource) create(rw http.ResponseWriter, name string, u *schema.Unit) {
if err := ur.cAPI.CreateUnit(u); err != nil {
log.Errorf("Failed creating Unit(%s) in Registry: %v", u.Name, err)
sendError(rw, http.StatusInternalServerError, nil)
return
}
rw.WriteHeader(http.StatusCreated)
}
Hey guys I join the conversation cause @gfazioli is my collegue (he deals with the code and I'm dealing with the CoreOS cluster). I suspected an etcd2 culprit. This is our testing infra:
In the last days we were trying to launch 100 units (with its discovery) for testing, across all the 5 machines (and not only the two workers). For this reason I suspected that etcd2 cluster was suffering.
Today we started seeing machines disappearing and reappearing again (never seen before). I restarted etcd2 service on all machines and this stopped. The only error I've seen on etcd2 logs is:
015/09/09 09:09:48 sender: error posting to 172991be8af6b7f8: unexpected http status Internal Server Error while posting to "http://192.168.101.11:2380/raft"
The next test we're going to do is to launch our units only on worker machines.
Btw any idea on how to reset etcd2 metrics? Atm I see a lot of getsFail
, deleteFail
, createFail
and I can't understand if they are referred to the past or are new.
FYI these are etcd2 cluster stats
@gfazioli You have to update etcd to 2.1 to get all the metrics. See https://github.com/coreos/etcd/blob/release-2.1/Documentation/metrics.md
@gfazioli The easiest way to identify if it is an etcd issue is to monitoring the etcd log. You would see leader elections if etcd is not stable or under very heavy load.
I'm pretty sure this is the same as #1650 but there is not enough information here to tell how fleet and/or fleetctl was being used.
Hi there, I have a PHP loop with 10, 20 or 30 CURL to Fleet API. I used it as STRESS TEST
This is my method class
But sometimes (randomly), the
$res = curl_exec( $ch );
return 500 HTTP Status with a blank human-readable message. Consider that$unitId
and$options
are very simple and I keep them equals for each test.Any suggestions?
Thanks in advance