Atkinson-Lab / Tractor

Scripts for implementing the Tractor pipeline
MIT License
44 stars 5 forks source link

Question regarding VCFs and issue with the R script #36

Open EfraMP opened 3 months ago

EfraMP commented 3 months ago

I'm trying to perform a 3-way model.

Does the R script is limited to a maximum number of samples/features? After running

Rscript path/Tractor/scripts/run_tractor.R  \
    --hapdose path/chr21/chr21.annotated.LiftOver.dose \
    --phe path/phe.txt 
    --method logistic --out sumstats.tsv

I get the next output:

Tractor Script Version: 1.1.0 
Loading required package: optparse
Running Tractor... 
Error in data[[1]] : subscript out of bounds
Calls: RunTractor
Execution halted

After a quick look up, it's clear that I'm having an issue with the hapcount/dosage files, but I'm unsure what's going on.

On the other hand, I would like to use PLINK in order to test alternative models using the VCFs that one gets after running extract_tracts.py. According to your published paper, I would need to run 3 different GWAS and then perform a meta-analysis, albeit I'm unsure of something.

As far as I understand, in this case the complete model would be

$$\text{logit} (p)= \beta0 + \beta{LA1}LA1 +\beta{LA2}LA2 + \beta{G0} G0 + \beta{G1}G1 + \beta{G2} G_2 $$

If, let's say, I would like to test the deconvolved VCF from ancestry 0, would you recommend to use the haplotype counts from the other two ancestries as covariates? Why or why not? In your wiki you are ignoring the counts. Why? It also intrigues me that you are not using principal component as covariates.

Thanks.

kscott-1 commented 3 months ago

I will not comment on your statistics questions to the authors..... However, for your issues running run_tractor.R, it is likely that you are encountering file/col naming issues. I suggest you check out the code in the script.

Basically, the script has hard coded a search for the file names output from extract_tracts.py (line 60-62):

hapFiles = Sys.glob(paste0(prefix, ".*.hapcount.txt"))
doseFiles = Sys.glob(paste0(prefix, ".*.dosage.txt"))
inFiles = c(hapFiles, doseFiles)

['prefix.anc0.hapcount.txt', 'prefix.anc0.dosage.txt', ... , 'prefix.ancN.hapcount.txt', 'prefix.ancN.dosage.txt'] becomes a list of files that the program uses to form matrices to parse as input for the lm/glm at each SNP.

There are, in my opinion at least, other issues with the R script that I am trying to solve. I have a couple PRs for some of this stuff. The phenotype file only accepts numeric vars (problem for sex even if coded as 0/1 or 1/2), "IID" and "y" are hard coded as required variable names, the phenotype file cannot have 'unused' columns, the samples from hapdose files must match the pheno samples exactly (no easy way to perform different age cutoffs)... So yeah, just be careful and look through the code to understand exactly what you're running.

I hope that helps.

EfraMP commented 3 months ago

Yes, I know. I looked into the code before opening the issue, but I'm still having issues with it.

I'd tried both absolute and relative routes, and it's just not working.

Indeed, the issue was the name. It's kinda counterintuitive that extract_tracts.py can compress its output and then, in the following step, run_tractor.R is unable to read those files. I just saw that you had requested to pull.

Maybe

  hapFiles = Sys.glob(paste0(prefix, ".*.hapcount.txt*"))
  doseFiles = Sys.glob(paste0(prefix, ".*.dosage.txt*"))
  inFiles = c(hapFiles, doseFiles)

would be an easier approach??

eatkinson commented 1 month ago

Hi!

Sorry about the delay in responding. Agreed about checking that your files are able to be matched up properly, that could be the issue. Other common issues folks have had with their input data that you should double check include confirming your VCF file and your msp file are on the same regions and are from the same input VCF file that was painted, that the naming convention is identical for chromosomes, variants and samples between them, and that your data is phased (that is, all variants have | rather than /). The model should not be limited by number of variants/samples/covariates, though of course will take longer to run the bigger your dataset is.

The scripts recently were freshed up to include some new flags, so you may want to do another pull in case we accounted for your edge case since this post.

Per your meta-analysis question, yes you could read in the deconvolved genotypes in plink, GWAS them there with standard models, and then meta-analyze the results. The caveat is that plink will need to fill in the missing values from the deconvolved tracts. You can do this several ways, none are ideal, but some are better than others (e.g. fill in with reference, vs duplicate the allele), but all will introduce some bias. This is why we strongly recommend the joint model rather than this strategy.

Per covariates - a metric of global ancestry should be included in the model - global proportions directly is preferred, but PCs can be used as a proxy. We do include PCs in all results in the Tractor manuscript. A demonstration of how to include PCs in Hail can be seen in the Hail tutorial, or you can include them in your covariates file for the R implementation.

Not sure I understood your hapcounts question fully, but you need N-1 terms for ancestral dosage irrespective of genotype at each spot.

Best, Elizabeth