dms-vep / dms-vep-pipeline-3

Pipeline for analyzing deep mutational scanning (DMS) of viral entry proteins (VEPs)
Other
2 stars 0 forks source link

Issue setting `barcode_runs` to `null` #64

Closed WillHannon-MCB closed 1 year ago

WillHannon-MCB commented 1 year ago

I don't know how common this would be, but it doesn't seem like you can actually set barcode_runs to null in the config if you only want to build variants. There are two (really one) reasons for this:

  1. You can't access the .dt property on an empty pd.DataFrame
AttributeError in file /fh/fast/bloom_j/computational_notebooks/whannon/2023/dms-vep-pipeline-3/Snakefile, line 50:
Can only use .dt accessor with datetimelike values

You could fix this by wrapping the following code in some condition that checks if the barcode_runs are provided:

if len(barcode_runs) > 0: # <--- 
  # make sure barcode run samples start with <library>-<YYMMDD>-
  sample_prefix = barcode_runs.assign(
      prefix=lambda x: (
          x["library"].astype(str) + "-" + x["date"].dt.strftime("%y%m%d") + "-"
      ),
      has_prefix=lambda x: x.apply(
          lambda r: r["sample"].startswith(r["prefix"]),
          axis=1,
      ),
  ).query("not has_prefix")
  if len(sample_prefix):
      raise ValueError(f"Some barcode run samples lack correct prefix:\n{sample_prefix}")

  # dicts mapping sample to library or date as string
  sample_to_library = barcode_runs.set_index("sample")["library"].to_dict()
  sample_to_date = (
      barcode_runs.assign(date_str=lambda x: x["date"].dt.strftime("%Y-%m-%d"))
      .set_index("sample")["date_str"]
      .to_dict()
  )
  1. This is probably less relevant, but you'll run into an error if you forget to exclude any extra analyses that require the barcode runs downstream.
func_effects_config: data/func_effects_config.yml  # Functional effects of mutations
antibody_escape_config: data/antibody_escape_config.yml  # escape assays (eg, antibodies)
summaries_config: data/summaries_config.yml  # Summaries across assays

Maybe it's worth also wrapping these in some kind of conditional based on the existence of barcode runs?

if len(barcode_runs) > 0: # <--- 
  # include additional rule sets if they have configs defined
  for rule_set in ["func_effects", "antibody_escape", "summaries"]:
      rule_set_config = f"{rule_set}_config"
      if (rule_set_config in config) and (config[rule_set_config] is not None):

          include: f"{rule_set}.smk"

Or, maybe it's better to leave this up to the user just in case you'd have analyses defined here that don't need the barcode runs?

jbloom commented 1 year ago

Sounds good, do you want to submit a pull request on this?