coreos / fleet

fleet ties together systemd and etcd into a distributed init system
Apache License 2.0
2.42k stars 302 forks source link

PUT HTTP API return status 500 without human-readable message #1345

Open gfazioli opened 9 years ago

gfazioli commented 9 years ago

Hi there, I have a PHP loop with 10, 20 or 30 CURL to Fleet API. I used it as STRESS TEST

CoreOS stable 723.3.0 
fleetctl 0.10.2

This is my method class

  public function putUnit( $unitId, $options, $desiredState = 'launched' )
  {
    // Path
    $path = '/fleet/v1/units/' . $unitId;

    // Build the post fields - default only the desired state
    $postFields = [
      'desiredState' => $desiredState,
      'options'      => $options
    ];

    $postFields = json_encode( $postFields );

    $ch = curl_init( $this->endpoint . $path );
    curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
    curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
    curl_setopt( $ch, CURLOPT_CUSTOMREQUEST, 'PUT' );
    curl_setopt( $ch, CURLOPT_POSTFIELDS, $postFields );
    curl_setopt( $ch, CURLOPT_HTTPHEADER, array(
                      'Content-Type: application/json',
                      'Content-Length: ' . strlen( $postFields )
                    )
    );
    $res        = curl_exec( $ch );
    $httpStatus = curl_getinfo( $ch, CURLINFO_HTTP_CODE );
    curl_close( $ch );

    $result = [
      'result'       => $res,
      'status'       => $httpStatus,
      'unit'         => $unitId,
      'options'      => $options,
      'desiredState' => $desiredState
    ];

    return $result;

  }

But sometimes (randomly), the $res = curl_exec( $ch ); return 500 HTTP Status with a blank human-readable message. Consider that $unitId and $options are very simple and I keep them equals for each test.

Any suggestions?

Thanks in advance

jonboulle commented 9 years ago

The place to start is probably looking at the fleet engine's logs and looking for errors there (in particular at the time you are receiving 500s on the client)

On Mon, Sep 7, 2015 at 8:48 AM, Giovambattista Fazioli < notifications@github.com> wrote:

Hi there, I have a PHP loop with 10, 20 or 30 CURL to Fleet API. I used it as STRESS TEST

CoreOS stable 723.3.0 fleetctl 0.10.2

This is my method class

public function putUnit( $unitId, $options, $desiredState = 'launched' ) { // Path $path = '/fleet/v1/units/' . $unitId; // Build the post fields - default only the desired state $postFields = [ 'desiredState' => $desiredState, 'options' => $options ]; $postFields = json_encode( $postFields ); $ch = curl_init( $this->endpoint . $path ); curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true ); curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true ); curl_setopt( $ch, CURLOPT_CUSTOMREQUEST, 'PUT' ); curl_setopt( $ch, CURLOPT_POSTFIELDS, $postFields ); curl_setopt( $ch, CURLOPT_HTTPHEADER, array( 'Content-Type: application/json', 'Content-Length: ' . strlen( $postFields ) ) ); $res = curl_exec( $ch ); $httpStatus = curl_getinfo( $ch, CURLINFO_HTTP_CODE ); curl_close( $ch ); $result = [ 'result' => $res, 'status' => $httpStatus, 'unit' => $unitId, 'options' => $options, 'desiredState' => $desiredState ]; return $result; }

But sometimes (randomly), the $res = curl_exec( $ch ); return 500 HTTP Status with a blank human-readable message. Consider that $unitId and $options are very simple and I keep them equals for each test.

Any suggestions?

Thanks in advance

— Reply to this email directly or view it on GitHub https://github.com/coreos/fleet/issues/1345.

wuqixuan commented 9 years ago

@gfazioli @jonboulle , yes, Http 500 error is the "Internal server error", it always does not give the exact internal reason to the browser. Need check the fleetd's logs.

gfazioli commented 9 years ago

Ok, the error is

ERROR units.go:223: Failed creating Unit(web_test-app-40.johnf.service) in Registry: timeout reached

Any ideas?

jonboulle commented 9 years ago

That's indicative of etcd taking too long to respond, and fleet giving up on the request - so now it seems most likely that performance of etcd is your culprit. You could start exploring some of the metrics it exposes to analyse its performance - https://github.com/coreos/etcd/blob/master/Documentation/metrics.md

On Tue, Sep 8, 2015 at 10:23 AM, Giovambattista Fazioli < notifications@github.com> wrote:

Ok, the error is

ERROR units.go:223: Failed creating Unit(web_test-app-40.johnf.service) in Registry: timeout reached

Any ideas?

— Reply to this email directly or view it on GitHub https://github.com/coreos/fleet/issues/1345#issuecomment-138640231.

gfazioli commented 9 years ago

@jonboulle Ok, I will try, thx

wuqixuan commented 9 years ago

Yes, seems like etcd is the culprit. Maybe not the performance issue, maybe etcd is stopping or some error.

func (ur *unitsResource) create(rw http.ResponseWriter, name string, u *schema.Unit) {
    if err := ur.cAPI.CreateUnit(u); err != nil {
        log.Errorf("Failed creating Unit(%s) in Registry: %v", u.Name, err)
        sendError(rw, http.StatusInternalServerError, nil)
        return
    }

    rw.WriteHeader(http.StatusCreated)
}
nicolaballotta commented 9 years ago

Hey guys I join the conversation cause @gfazioli is my collegue (he deals with the code and I'm dealing with the CoreOS cluster). I suspected an etcd2 culprit. This is our testing infra:

In the last days we were trying to launch 100 units (with its discovery) for testing, across all the 5 machines (and not only the two workers). For this reason I suspected that etcd2 cluster was suffering.

Today we started seeing machines disappearing and reappearing again (never seen before). I restarted etcd2 service on all machines and this stopped. The only error I've seen on etcd2 logs is:

015/09/09 09:09:48 sender: error posting to 172991be8af6b7f8: unexpected http status Internal Server Error while posting to "http://192.168.101.11:2380/raft"

The next test we're going to do is to launch our units only on worker machines.

Btw any idea on how to reset etcd2 metrics? Atm I see a lot of getsFail, deleteFail, createFail and I can't understand if they are referred to the past or are new.

nicolaballotta commented 9 years ago

FYI these are etcd2 cluster stats

screen shot 2015-09-09 at 12 21 05
xiang90 commented 9 years ago

@gfazioli You have to update etcd to 2.1 to get all the metrics. See https://github.com/coreos/etcd/blob/release-2.1/Documentation/metrics.md

xiang90 commented 9 years ago

@gfazioli The easiest way to identify if it is an etcd issue is to monitoring the etcd log. You would see leader elections if etcd is not stable or under very heavy load.

waddles commented 8 years ago

I'm pretty sure this is the same as #1650 but there is not enough information here to tell how fleet and/or fleetctl was being used.