PejLab / aFCn

Apache License 2.0
1 stars 1 forks source link

Modified scripts to be agnostic to input variantID format #3

Open dtaylo95 opened 11 months ago

dtaylo95 commented 11 months ago

As is, the tool requires that variants IDs (both in the input --eqtl file, and --vcf file) be formatted <chrom>_<pos>.... As far as I can tell, there are two reasons for this:

  1. It allows the program to parse the variant's position from its ID.
  2. It meets the formatting requirements used by lmfit in the fitting step.

I am proposing changes that make the tool agnostic to the format of the variant IDs (I can imagine some users have VCFs that use dbSNP rsIDs, for example). Briefly, the changes are as follows:

  1. the --eqtl file now must include two additional columns: variant_chr and variant_pos that describe the (1-based) position of each variant. This information is then used to fetch the genotypes from the tabix-indexed VCF
  2. Variants are assigned unique temporary IDs (a new variant_id_clean column) that meet the formatting requirements of lmfit and are used when fitting the model.
  3. I've also updated the gene_id_clean functionality to match that of the new variant_id_clean column. This assumes no specific formatting of the input gene IDs.