alteryx / locust-grasshopper

a load testing tool extended from locust
Apache License 2.0
173 stars 8 forks source link

Tags on Trends are incorrect #40

Closed sistratton closed 1 week ago

sistratton commented 1 week ago

We're using Trends to capture workflow runtimes, which involves polling our API.

I have a very simple test where I want to run a workflow with 10 users for 10 minutes. The scale-up rate of users is set to 1 (i.e. 1 user every 1 second).

The test currently tags the trends using the following code:

        self.update_tags(
            new_tags={"jobgroup_id": self.jobgroup_id, "jobgroup_status": "IN_PROGRESS", "user": self.vu_number}
        )

However when we run the code we get the following results:

image

Which shows workflows with the same user and same jobGroup ID, however, the times are different. They also complete at the same time, so they can't be the same user! We get this problem even with a small number of users.

image

There's also workflows that have no jobGroup ID (see the above table) and workflows that have 'success' set to 'True' when they actually failed (if you look in the management console).

We're concerned that the times are wrong also as they don't seem to match with what is reported in the management console for the jobgroup, but this could be a delay somewhere else.

jmfiola commented 1 week ago

@sistratton I think this has to do with when we are sending certain rows to the locust_requests table, could you please show us the raw rows from influxdb that are making the weird duplicate rows you are seeing in grafana?

It's weird because I thought I fixed this by making self.tags an instance-level attribute, so it is not shared between users. But maybe if its the same user, there's still some weirdness going on..

jmfiola commented 1 week ago

@sistratton And for the "success" column, that is a column that is only for HTTP requests. It's from locust-influxdb-listener. See source code here You will want to change your grafana query to use some other tag

sistratton commented 1 week ago

@jmfiola thanks, how do I see if the workflow has failed? It is also showing as jobgroupstatus 'complete'. Should it be status 'failed'? It must be the tagging issue also in that case, I guess :( Attached is a dump of the whole table from this 10 min run, can also attach other examples if it's useful? [Workflow (Firefly) runtimes https_int.bender.rocks-data-2024-06-13 20_1918.csv](https://github.com/user-attachments/files/15827050/Workflow.Firefly.runtimes..https_int.bender.rocks-data-2024-06-13.20_19_18.csv)

jmfiola commented 1 week ago

@sistratton yeah, it should be jobgroup_status. I THINK what may be happening is the check in your journey is kicking the user out of that task once it fails, and then any other subsequent lines in that journey task are ignored. Does that sound like it could be the case? Could you perhaps link the relevant code in the journey file?

sistratton commented 1 week ago

@jmfiola not all of the workflows are failing, it's only occasionally that one fails. The tagging issue happens with all the workflows, even the ones that pass. Here's the journey, there aren't any other lines in the journey task after the check, unless it's the wrapper code or the teardown:

class WorkflowRunJourney(TrifactaBaseJourney):
    wait_time = between(min_wait=20, max_wait=30)
    defaults = {
        "workflow_name": "snowflake_simple",
        "execution_engine": "snowflake",
        "test_type": "exploratory",
    }
    workflow_id = None

    def workflow_run_on_start(self):
        workflows_model = WorkflowsModel(
            self.scenario_args, self.scenario_args["global_token"]
        )
        if self.workflow_id is None:
            self.workflow_id = workflows_model.get_workflow_id_by_name(
                workflow_name=self.scenario_args["workflow_name"]
            )

    def on_start(self):
        try:
            super().on_start()
            self.workflow_run_on_start()
        except Exception as e:
            # Locust is swallowing errors in on_start, so log it and then re-raise
            logger.error(f"Error in on_start: {type(e).__name__}{e}")
            raise e

    @task
    @k8s_job_report
    @custom_trend("PX_TREND_workflow_run")
    def workflow_run_journey(self):
        scenario_arg_tags = [
            "workflow_name",
            "execution_engine",
            "dataset_size",
            "input_type",
            "snowflake_size",
            "workflow_complexity",
            "version",
            "test_type",
        ]
        tags_to_update = {key: self.scenario_args.get(key) for key in scenario_arg_tags}
        self.update_tags(tags_to_update)
        self.update_tags({"jobgroup_id": None, "jobgroup_status": None})
        logger.info(
            f"VU {self.vu_number}: Starting workflow run journey for workflow "
            f"'{self.scenario_args['workflow_name']}' ..."
        )
        workflows_model = WorkflowsModel(
            self.scenario_args, self.scenario_args["global_token"]
        )
        workloads_model = WorkloadsModel(
            self.scenario_args, self.scenario_args["global_token"]
        )
        # self.jobgroup_id is used in k8s_job_report decorator
        self.jobgroup_id, job_id = workflows_model.trigger_workflow_run(
            workflow_id=self.workflow_id,
            execution_engine=self.scenario_args["execution_engine"],
        )
        self.update_tags(
            new_tags={"jobgroup_id": self.jobgroup_id, "jobgroup_status": "IN_PROGRESS", "user": self.vu_number}
        )
        logger.info(
            f"VU {self.vu_number}: workflow triggered. "
            f"(jobgroup id: {self.jobgroup_id})"
        )
        check(
            f"{self.__class__.__name__} | jobgroup id returned",
            self.jobgroup_id and str(self.jobgroup_id).isnumeric(),
            self.environment,
        )
        workload_response = workloads_model.monitor_workload_execution_get_response(
            workload_execution_id=job_id,
            seconds_between_polls=2,
            job_timeout=self.scenario_args["job_timeout"],  # client side timeout
        )
        self.update_tags(
            new_tags={"jobgroup_status": workload_response.workload_execution_status}
        )
        check(
            f"{self.__class__.__name__} | workflow successfully completed",
            workload_response.workload_execution_status == "COMPLETE",
            self.environment,
        )
jmfiola commented 1 week ago

I think this may be related to the k8s job report decorator. Can you try just commenting out that line @k8s_job_report line and see if the issue persists? Suzanne also told me about hopping on a call on Monday which I am super down for as well.

sistratton commented 1 week ago

Yeah, that was the first thing I thought, but no, removing it makes no difference. Do all the users share the same locust environment? So the tags dict and/or some values in the influxdb listener are being overwritten by the tag data from other users?

sistratton commented 1 week ago

(Assuming the base journey object isn't being shared between users) One quick fix would be just to keep all the of the tags as local variables and then only update the tags dict and influxdb listener just before the metrics are sent off. I guess there's still a chance of them getting overwritten when there's loads of users though (e.g. 200).

jmfiola commented 1 week ago

Yeah, that was the first thing I thought, but no, removing it makes no difference. Do all the users share the same locust environment? So the tags dict and/or some values in the influxdb listener are being overwritten by the tag data from other users?

this should not be the case, as each user has its own instance-level tags property. So they are pretty much just local variables per user. Which is why I am so perplexed by this... tomorrow I will try to play around with it some more.

jmfiola commented 1 week ago

Furthermore, it looks there are no users that share the same jobgroup, so I don't think its a shared tags issue. I think its just that something is firing twice when it should only be firing once.

sistratton commented 1 week ago

Ah, ok, so it's not that the 9 users are picking up the tags of the 10th user, it's actually that there's only the 10th user running and it's sending out 10 results, all with different response times?

image
jmfiola commented 1 week ago

Yes, I think that must be it. Which suggests that there are other rows which are not the custom trend that are being picked up by that query. I have a suspicion that if you make the query more restrictive to make it so the trend name must be the workflow trend name, you will not see that.

jmfiola commented 1 week ago

Created a PR for this: https://github.com/alteryx/locust-grasshopper/pull/42

jmfiola commented 1 week ago

Fixed in 1.3.23!