Write standard job launcher function

pbilling commented 1 year ago

Current method for adding new bioinformatics (or other) tasks to Trellis is to create a new Cloud Function specifically tailored to launch jobs of that type of task (e.g. "samtools flagstat"). Limitations of generating separate functions for each task include:

Copying of a lot of boilerplate code across functions
Potential to differences in boilerplate code across functions
Changing mechanisms for launching jobs requires changing every launcher function
Creating a new launcher function is kind of an obtuse process, requiring knowledge of Python and the 'trellisdata' package. If you didn't have an example to look at it, it would be a huge pain.

A better approach could be to write a single job launcher function and use a YAML configuration file to define the parameters of all the supported tasks. Benefits:

Eliminate the need for copy boilerplate code
Does not require Python/Trellis knowledge to implement new tasks
YAML configuration definition is consistent with method for defining Trellis database queries and triggers
Easier to update and maintain code for launching jobs

pbilling commented 1 year ago

When every job had its own launcher function, the job was determined by the pubsub topic that the database query result was published to. The topic was defined as part of the database query. How will I choose the job if all query results are routed through the same function?

I could update the QueryResponse classes to also include a field with the task to be launched.

pbilling commented 1 year ago

Another challenge: How do I specify output URIs? These paths involve multiple variables including ones from the Trellis config, JobLauncher config, and variables defined at runtime (task-id).

pbilling commented 1 year ago

The solution I'm settling on only includes input-specific variables in the task template. Other values (from Trellis, defined at runtime) will be applied uniformly to all job outputs, regardless of task. This changes the ordering of the output elements, but not the content.

An example where bold values have been rearranged and italic values will be defined in the job_launcher function while non-italicized values come from a task template:

Old: gs://{OUT_BUCKET}/{plate}/{sample}/{task.name}/{jobid}/output/{sample}{read_group}.ubam

New: _gs://{OUT_BUCKET}/{task.name}/{jobid}/output/{plate}/{sample}/{sample}_{read_group}.ubam

Here, the plate and sample values have been moved to the end because they (and read_group) are all gotten from the properties of one of the input objects using the string ".format()" method. The structure and types of these input-derived properties may also change from task to task. For instance, if a task requires combining inputs from multiple samples then it doesn't make sense to put them in a path with a singe set of plate/sample values.

Conversely, the {OUT_BUCKET}, {task.name}, and {job_id} values can be applied uniformly to all jobs and their structure will be defined in the job_launcher function. So, the new structure of the outputs reflects a functional organization of values based on how Trellis organizes tasks

pbilling commented 1 year ago

Implement job launcher function and write tests

[x] Write create_job_dict() method
[x] Write _get_job_values() method for translating template inputs/envs to values
[x] Write _get_output_values() method for translating template outputs to values
[x] Write _get_label_values() method for translating template labels to values
[x] Add tests for _get_job_values()
[x] Add tests for _get_output_values()
[x] Add tests for _get_label_values()
[x] Add tests for create_job_dict()
[x] Run unit tests

pbilling commented 1 year ago

Methods for populating dsub values from template implemented in 2b2754fc.

pbilling commented 1 year ago

Task: Create a Cloud Build trigger for Job Launcher function

[x] Create google_cloudbuild_trigger resource in terraform (trellis-mvp-terraform/3-trellis/build-triggers-qc-functions.tf)
[x] Create google_pubsub_topic resource in terraform (trellis-mvp-terraform/3-trellis/pubsub-topics.tf)
[x] Create Cloud Build deployment script in function source directory
[x] Upload job launcher YAML config file to GCS
[x] Add job launcher config path to Trellis config document
[x] Deploy function

pbilling commented 1 year ago

Task: Run integration test

[x] Delete existing launch-fastq-to-ubam function
[x] Update launch-fastq-to-ubam database query "publish_to" topic
[ ] Add a matched pair of Fastqs to the database
[ ] Check cloud function logs

Cloud Functions logs query:

resource.type="cloud_function"
severity=(DEFAULT OR INFO OR NOTICE OR WARNING OR ERROR OR CRITICAL OR ALERT OR EMERGENCY)

pbilling commented 1 year ago

Forgot to implement parse_node_inputs() and parse_relationship_inputs() functions. The parse_inputs() function was originally designed to perform job-specific QA on the inputs, which is an idea I like, but how to do it in a generic manner?

I can pretty easily check that the node labels and relationship type are correct, but more than that would require a pretty significant update to the job config template. I think I'll just keep it simple for now.

pbilling commented 1 year ago

AttributeError: 'QueryResponseReader' object has no attribute 'job_request'

Probably need to upload new version of trellisdata package.

pbilling commented 1 year ago

Task: Update trellisdata package

[x] Add job_request attribute to DatabaseQuery class
[x] Add job_request attribute check to DatabaseQuery equality method
[x] Add JobLauncher class to trellisdata
[x] In trellisdata init.py file, import JobLauncher
[x] Add job_request attribute to QueryResponseWriter & Reader classes
[x] Update db-query to get job_request field from DatabaseQuery instances & pass to QueryResponseWriter object
[x] Update job-launcher function to get job_request field from QueryResponseReader object

pbilling commented 1 year ago

Task: Update trellisdata tests to check new job request features

[x] Test that job_request attribute has been added to DatabaseQuery instances
[x] Create test file for JobLauncher class
[x] Test for job_request handling in QueryResponseWriter & Reader classes

pbilling commented 1 year ago

For testing, I'm using the sample name "test-sample-v1-3" in GCP Secret Manager in the test project

pbilling commented 1 year ago

Status update: successfully launched first fastq-to-ubam job

StanfordBioinformatics / trellis-mvp-functions

Write standard job launcher function #36