Refinebio Processing Cost Estimation Tool

Context

We want to be able to predict with some accuracy how much it costs to process any given type of experiment / sample on the refinebio pipeline.

Since we can't say with 100% accuracy what it would cost to run something we haven't tried to run, we should be able to take a similar historical data to get an idea of what a potential new experiment / set of experiments would cost to survey and process.

I see this tool being usable in two major ways. First, we would be able to predict how much we would spend to process a requested experiment or collection of experiments, or we would be able to explain the processing cost (ignoring human costs) that went into a user's requested dataset.

Problem or idea

Functionality:

Most of this logic should be written into a class in the data_refinery_common project, we should also make this available to execute as a command from the foreman instance in the initial pass
The class should be initialized with some sort of filtering attributes (experiment accessions, sample accessions, etc)
It should download the current price list using the AWS module
It should convert jobs to a JobDescriptions DataClass so that we can standardize working with the relevant information. They should have their ram tier, duration, pipeline, success, etc information that we will use to calculate costs

Future functionality:

Each job runs on an allocated EC2 instance but we will scale horizontally if we cannot fit more pending jobs into any existing allocated EC2. This will mean if we just calculate the cost for duration on the box our costs will be bloated. We should write in some functionality that 'stacks' the cost of jobs that would run concurrently on the same instance.
We should consider using the AWS module to fetch our current instance types or read them in from the Terraform file at some point.
We should consider making this available via the API as it might be useful to expose this functionality.
We use spot instances but will take the on-demand price if nothing is available. I do not know if this means we always take the on-demand price but for now we can just use this price for our estimates. We should show a price range based on 80% of on demand cost and the full on demand cost.
We should break down the costs by type. In the future it would be nice to know which technologies / experiment types are more expensive to compute.

These are from my pseudo code notes from when I was mentally modeling the requirements:


def get_jobs_from_query(experiments, samples, dataset, platform, sample_size):
    """"Get all associated processor_jobs from the requested estimation. You can pass in either a specific dataset_id"""
    # generate a queryset of all associated processor jobs that can be fetched through the passed in data.

    surveyor_jobs = SurveyorJobs.objects.filter().unique()
    # ...

    return surveyor_jobs, downloader_jobs, processor_jobs

def describe_job(job):
    """Jobs are not polymorphic but we do use the same attribute names for what we need to know here"""

    return {
      ram_ammount: instance_type_from_ram(job.ram_amount),
      start_time: job.start_time,
      end_time: job.end_time,
      duration: (end_time - start_time).total_seconds(),
      success: job.success,
          # other attributes that would be helpful for cost breakdown
          organism: ????
    }

def map_jobs_to_description(survey_jobs, downloader_jobs, processor_jobs):
    """  """
    survey_descriptions = [describe_job(job) for job in survey_jobs]
    # ...

    return survey_descriptions, downloader_descriptions, processor_descriptions

def instance_type_from_ram(ram_amount):
    # workers ["m5.12xlarge", "m5.16xlarge", "r5.12xlarge", "r5.8xlarge"]
    # return the instance type that corresponds with the ram_amount on the job

def calculate_costs_from_descriptions(job_descriptions):
    # map reduce / compare against the price list that was downloaded

def fetch_current_aws_price_list():
    # download the current price list from the aws api

Instance types based on terraform:


# workers ["m5.12xlarge", "m5.16xlarge", "r5.12xlarge", "r5.8xlarge"]
# smasher ["m5.2xlarge", "r5.xlarge"]
# compendia ["x1.16xlarge"]

Solution or next step

write a script that acts as a proof of concept that can take a dataset
determine what inputs and outputs should look like
convert the POC script to a helper class in data_refinery_common
add a management command for future queries

AlexsLemonade / refinebio