atcg / clustOpt

Scripts for the optimization of RADseq clustering thresholds in population genetics
MIT License
9 stars 0 forks source link

Error with vcfMissingness.pl #1

Closed btmartin721 closed 5 years ago

btmartin721 commented 5 years ago

Hello,

I am having an issue with vcfMissingness.pl. Specifically, after running for almost exactly 2 hours 40 minutes, I get the below error with one of the R scripts that is called from the perl script. vcfMissingness.pl generates the .012, .012.indv, .012.pos, .missingness files, and minimum/max pairwise missingness values fine, but then gives the below error when it gets to: Parsing "c88.vcf" ...

It says that for the first vcf file then it crashes.

At first, I thought there was an issue with the R script recognizing the path to my VCF files. I was specifying the correct paths to them in my vcflist.txt file (e.g., /home/USER/clustOpt/vcf/c88.vcf) with one file per line. I then tried copying the VCF files to both the clustOpt and clustOpt/resources directories and tried running it again but got the same error. For some reason it's not reading the VCF files correctly. The VCF files were all made with ipyrad v. 0.7.30.

The ipyrad VCF file is a little different than the original pyRAD's. clustOpt is compatible with ipyrad's VCF files, right?

If so, do you have any insight as to what might be wrong here? All the dependencies that are listed in your README.md are installed as far as I know.

Thanks for your time.

Here's the error:

Read 8 items Error in snpgdsVCF2GDS(vcfFiles[i], gdsOut) : LENGTH or similar applied to NULL object Execution halted Error in snpgdsVCF2GDS(vcfFiles[i], gdsOut, verbose = F) : LENGTH or similar applied to NULL object Execution halted Error in snpgdsVCF2GDS(vcfFiles[i], gdsOut, verbose = F) : LENGTH or similar applied to NULL object Execution halted sending incremental file list

atcg commented 5 years ago

Hi Bradley,

That is an error with SNPRelate (not clustOpt). What version of SNPRelate are you using?

btmartin721 commented 5 years ago

Hi. Thank you for your reply. I am using SNPRelate_1.6.6 in R 3.5.1 on CentOS 6.

I installed most of the dependencies, including SNPRelate, throught conda. I am working on a HPC cluster with CentOS 6 and don't have root privileges. I have found that using conda to install R packages is much easier than doing so through R directly. Using install.packages() often leads to dependency errors. However, it's certainly possible that conda may have installed a different version of SNPRelate than what clustOpt was tested with and/or that it created issues.

-Bradley

atcg commented 5 years ago

This appears to be a known issue of SNPRelate v1.6. The author says it was fixed with the Dec 2017 release (v1.12.1) (see the release notes here: https://github.com/zhengxwen/SNPRelate/releases/tag/v1.12.1).

I recommend upgrading SNPRelate and seeing if the problem persists.

btmartin721 commented 5 years ago

Hi, I tried using another version of SNPRelate: SNPRelate_1.16.0

This time I installed it and all dependencies through R directly, and not through conda. After running clustOpt again, it got farther but I received a different error:

Error in is.matrix(dist) | inherits(dist, "snpgdsDissClass") | inherits(dist, : There is no SNP!

Just above the error, this line was printed to STDOUT: Excluding 43,829 SNPs on non-autosomes

That's how many SNPs are in the whole alignment, so for some reason it filtered out all of them as non-autosomal. Do you have any idea why it would have done that? I'm trying to determine if it's an issue with my dataset, if it's dependency-related, or something else.

Thanks for your time.

-Bradley

Detailed STDOUT:

VCF Format ==> SNP GDS Format Method: exacting biallelic SNPs Number of samples: 250 Parsing "/home/btm002/clustOpt/original_vcf/c70.vcf" ... import 43829 variants.

atcg commented 5 years ago

Ah yes, this is another little peculiarity of SNPRelate. Some versions assume you are working with human chromosomes and will by default ignore scaffolds not named 1-22. You can change this by adding autosome.only=F to the offending SNPRelate calls.

Specifically, line 30 in ibdSlope.R should be changed to read:

ibsMat <- snpgdsIBS(genofile, num.thread=2, verbose=F, autosome.only=F)

And line 141 in resources/missingnessHeatMaps.R should be changed to read:

ibsInter <- snpgdsHCluster(snpgdsIBS(gdsInter, num.thread=2, autosome.only=F))

btmartin721 commented 5 years ago

Thank you. I will run it again with the updated arguments.

btmartin721 commented 5 years ago

Got it to work. Thanks for all your help. A few notes:

  1. After your last comment, I added autosome.only=F and it got further. Another error occurred, but it was a data-related issue. I realized that a few of my individuals had almost all missing data (N=3 to N=15 good SNPs) and it didn't like that. So I removed individuals with >90% missing data, re-ran it, and it worked fine.

  2. In vcfToPCAvarExplained.R I also had to add autosome.only=F to the snpgdsPCA method call.

And voila! I really like your program. Thanks for writing/publishing it.

Best,

-Bradley