Open xxkennyxu opened 9 months ago
Hi @xxkennyxu, it looks like the request is timing out. If you're only using len(running_jobs)
, you could switch to the method instance.get_runs_count
which should be faster. If you need the run objects, there's a limit
and cursor
argument on get_run_records
that you could experiment with
Closing, but let me know if this doesn't work and I'll reopen
Hi @xxkennyxu, it looks like the request is timing out. If you're only using
len(running_jobs)
, you could switch to the methodinstance.get_runs_count
which should be faster. If you need the run objects, there's alimit
andcursor
argument onget_run_records
that you could experiment with
Thanks! Let me try that :)
Getting the same error unfortunately :(
dagster_cloud_cli.core.errors.GraphQLStorageError: Error in GraphQL response: [{'message': 'Internal Server Error (Trace ID: 5143878769993498507)', 'locations': [{'line': 4, 'column': 13}], 'path': ['runs', 'getRunsCount']}]
Hey @xxkennyxu - can you try this workaround? Counterintuitively I think the additional job name filter is causing it to be less performant and sporadically time out, so doing that filter in memory might work better:
run_records = context.instance.get_run_records(
filters=RunsFilter(
statuses=[
DagsterRunStatus.STARTING,
DagsterRunStatus.STARTED,
DagsterRunStatus.QUEUED,
],
)
)
run_records = [run_record for run_record in run_records if run_record.dagster_run.job_name == "my_job"]
We'll look into improvements on our side for this, but that's a change you can make right away in the meantime.
(that workaround won't be good for all cases if the job name is significantly reducing the number of runs returned, but generally the number of in-progress runs across all jobs is capped at a reasonable value, so it should be an option here)
Yup this works for now as we don't have too many jobs in these states. Usually < 50.
Thanks for helping troubleshoot here - I haven't seen any other errors related to this since changing it! 🎉
Can you keep me posted when we're good to use the job_name
filter again 🙏
Dagster version
dagster, version 1.2.2
What's the issue?
We have a sensor that checks ongoing runs and if there are more than X running, we should not be kicking off a new materialization.
Occasionally we run into this exception:
Any clue why we might be running into this exception - this triggers an alert on Slack for us each time
What did you expect to happen?
No response
How to reproduce?
No response
Deployment type
None
Deployment details
No response
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.