choishingwan / PRSice

A software package for calculating, applying, evaluating and plotting the results of polygenic risk scores
http://prsice.info
GNU General Public License v3.0
180 stars 85 forks source link

File check up-front #14

Closed choishingwan closed 5 years ago

choishingwan commented 7 years ago

Might want to check all the file inputs at the very beginning (especially covariate file). It is rather annoying that the program error out after clumping and other procedures.

choishingwan commented 7 years ago

Partly completed except that the phenotype + covariate check is still behind the clumping.

Might want to add a flag to the genotype class to indicate if clumping and sorting has been done. Then we can shift the phenotype + covariate checking up-front.

But then, when there are multiple phenotype, we might want to check all of them upfront? That'd require more coding

choishingwan commented 5 years ago

PRSice now shift the reading sequence of file to

  1. Check header of covariate file
  2. Base Summary statistics
  3. Target Sample Information
  4. Target SNP information
  5. Reference Sample Information
  6. Reference SNP information
  7. Calculate MAF in target (and do filtering)
  8. Calculate MAF in reference (and do filtering)
  9. Read in region files (GTF, MSigDB, BED if used)
  10. Check header of phenotype file
  11. Clumping
  12. Process phenotype and covariate file

PRSice will terminate if in any point the input file is ill formed. Note that phenotype and covariate checks comes after clumping because we want to accomodate multiple phenotype input (which might lead to difference in covariate inclusion due to phenotype NA etc). While we can move the file check up front, that will be inefficient and not very practical (time consuming to check for all phenotype covariate combinations) and thus we decided against it. Users should try their best to ensure their input is correct, or if they are uncertain and would like to use PRSice to test it, then a good way will be a "dry run" of PRSice with the --no-clump option, or to run PRSice with --print-snp option so that if PRSice failed in phenotype and covariate check, users can still re-run PRSice without needing to do full clumping by using --extract PRSice.snp assuming --out PRSice is used in the first pass