AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
128 stars 19 forks source link

Refinebio Processing Cost Estimation Tool #3198

Open davidsmejia opened 1 year ago

davidsmejia commented 1 year ago

Context

We want to be able to predict with some accuracy how much it costs to process any given type of experiment / sample on the refinebio pipeline.

Since we can't say with 100% accuracy what it would cost to run something we haven't tried to run, we should be able to take a similar historical data to get an idea of what a potential new experiment / set of experiments would cost to survey and process.

I see this tool being usable in two major ways. First, we would be able to predict how much we would spend to process a requested experiment or collection of experiments, or we would be able to explain the processing cost (ignoring human costs) that went into a user's requested dataset.

Problem or idea

Functionality:

Future functionality:

These are from my pseudo code notes from when I was mentally modeling the requirements:


def get_jobs_from_query(experiments, samples, dataset, platform, sample_size):
    """"Get all associated processor_jobs from the requested estimation. You can pass in either a specific dataset_id"""
    # generate a queryset of all associated processor jobs that can be fetched through the passed in data.

    surveyor_jobs = SurveyorJobs.objects.filter().unique()
    # ...

    return surveyor_jobs, downloader_jobs, processor_jobs

def describe_job(job):
    """Jobs are not polymorphic but we do use the same attribute names for what we need to know here"""

    return {
      ram_ammount: instance_type_from_ram(job.ram_amount),
      start_time: job.start_time,
      end_time: job.end_time,
      duration: (end_time - start_time).total_seconds(),
      success: job.success,
          # other attributes that would be helpful for cost breakdown
          organism: ????
    }

def map_jobs_to_description(survey_jobs, downloader_jobs, processor_jobs):
    """  """
    survey_descriptions = [describe_job(job) for job in survey_jobs]
    # ...

    return survey_descriptions, downloader_descriptions, processor_descriptions

def instance_type_from_ram(ram_amount):
    # workers ["m5.12xlarge", "m5.16xlarge", "r5.12xlarge", "r5.8xlarge"]
    # return the instance type that corresponds with the ram_amount on the job

def calculate_costs_from_descriptions(job_descriptions):
    # map reduce / compare against the price list that was downloaded

def fetch_current_aws_price_list():
    # download the current price list from the aws api 

Instance types based on terraform:


# workers ["m5.12xlarge", "m5.16xlarge", "r5.12xlarge", "r5.8xlarge"]
# smasher ["m5.2xlarge", "r5.xlarge"]
# compendia ["x1.16xlarge"]

Solution or next step