diskin-lab-chop / AutoGVP

19 stars 3 forks source link

Feature request: optimize modifying of ClinVar vcf file #132

Closed rjcorb closed 1 year ago

rjcorb commented 1 year ago

Purpose/implementation Section

Briefly describe the feature and provide meaningful references

The following lines of code can be improved by replacing grepl() with str_detect:

clinvar_anno_vcf_df <- vcf_input %>%
  dplyr::mutate(
    vcf_id = str_remove_all(paste(CHROM, "-", START, "-", REF, "-", ALT), " "),
    vcf_id = str_replace_all(vcf_id, "chr", ""),
    # add star annotations to clinVar results table based on filters // ## default version
    Stars = ifelse(grepl("CLNREVSTAT\\=criteria_provided,_single_submitter", INFO), "1",
                   ifelse(grepl("CLNREVSTAT\\=criteria_provided,_multiple_submitters", INFO), "2",
                          ifelse(grepl("CLNREVSTAT\\=reviewed_by_expert_panel", INFO), "3",
                                 ifelse(grepl("CLNREVSTAT\\=practice_guideline", INFO), "4",
                                        ifelse(grepl("CLNREVSTAT\\=criteria_provided,_conflicting_interpretations", INFO), "1NR", "0")
                                 )
                          )
                   )
    ),
    ## extract the calls and put in own column
    final_call = str_match(INFO, "CLNSIG\\=(\\w+)([\\|\\/]\\w+)*\\;")[, 2]
  )

Testing on a pbta data set, this replacement reduced the time to run this code chunk by 72%

What input data are required for this feature or analysis?

The modified script will be tested on pbta and custom test files.

How do you plan to organise the feature or analysis - will it be a multi-function call or add to existing functions?

Only modifying code chunk in script

Who will complete the feature request (please add a GitHub handle here if relevant)?

@rjcorb