emmanuelparadis / pegas

Population and Evolutionary Genetics Analysis System
GNU General Public License v2.0
28 stars 10 forks source link

Problems with LDScan #44

Closed nschan closed 4 years ago

nschan commented 4 years ago

Hi,

I am currently trying to use the pegas LDScan function. The data i am interested in come from the Arabipopsis 1001 genomes (1001genomes.org) initiative and can be downloaded in vcf format, I am getting the following error (or variations, depending on the data):

Error in dim(tmp) <- c(nh, nloc) : dims [product 802] do not match the length of object [797]

Here is the code I am using:

gts <-"742,763,772,854,997,1006,1063,1552,1757,1852,2171,2278,4807,4884,5210,5651,5718,5741,5768,5784,5800,5831,5890,5950,6012,6016,6040,6070,6071,6088,6099,6101,6106,6107,6123,6125,6134,6216,6220,6258,6276,6284,6296,6396,6830,6898,6903,6909,6911,6918,6924,6927,6944,6951,6960,6971,6976,6982,6984,6986,6997,7071,7081,7092,7103,7160,7162,7163,7244,7248,7268,7282,7305,7322,7327,7328,7344,7346,7372,7411,7475,7517,7530,8171,8237,8238,8240,8246,8249,8259,8284,8306,8326,8334,8376,8386,8424,8426,9070,9095,9106,9114,9115,9131,9323,9363,9369,9380,9391,9399,9433,9450,9507,9508,9511,9513,9514,9518,9519,9521,9529,9530,9534,9540,9541,9549,9554,9556,9560,9568,9584,9587,9589,9597,9602,9611,9612,9613,9627,9628,9631,9637,9638,9656,9668,9673,9678,9683,9685,9692,9693,9694,9699,9700,9708,9726,9730,9731,9733,9736,9743,9744,9769,9774,9780,9782,9783,9792,9794,9796,9798,9799,9806,9807,9811,9814,9816,9820,9832,9837,9845,9846,9847,9848,9849,9860,9861,9868,9871,9873,9882,9886,9895,9900,9904,9941,9949,9951,9960,9968,9981,9987,9990,9993,9995,9997,10002,10027,15591,15592"
reg <- "Chr3:6183403..6183803"

debug_tmp <- tempfile()

paste0("http://tools.1001genomes.org/api/v1/vcfsubset/strains/",
                       gts,
                       "/regions/",
                       reg,
                       "/type/fullgenome/format/vcf") %>% httr::GET(.) %>% httr::content() %>% 
  writeLines(., debug_tmp)

debug_vcf_info <- pegas::VCFloci(debug_tmp)

debug_vcf <-pegas::read.vcf(debug_tmp)

names(debug_vcf) <- paste(debug_vcf_info$CHROM, debug_vcf_info$POS, sep = ".")
pegas::LDscan(debug_vcf)

Do you have any idea how I can solve this?

Best Niklas

emmanuelparadis commented 4 years ago

Hi,

I think the writeLines(., debug_tmp) command truncates the VCF file which then cannot be read correctly by VCFloci and other functions in pegas. I guess you should download the whole file, scan it with VCFloci() and read what you need with read.vcf().

Tell me if that works.

Best,

Emmanuel

Thu, 31 Oct 2019 08:19:19 -0700 nschan notifications@github.com:

Hi,

I am currently trying to use the pegas LDScan function. The data i am interested in come from the Arabipopsis 1001 genomes (1001genomes.org) initiative and can be downloaded in vcf format, I am getting the following error (or variations, depending on the data):

Error in dim(tmp) <- c(nh, nloc) : dims [product 802] do not match the length of object [797]

Here is the code I am using:

gts 
<-"742,763,772,854,997,1006,1063,1552,1757,1852,2171,2278,4807,4884,5210,5651,5718,5741,5768,5784,5800,5831,5890,5950,6012,6016,6040,6070,6071,6088,6099,6101,6106,6107,6123,6125,6134,6216,6220,6258,6276,6284,6296,6396,6830,6898,6903,6909,6911,6918,6924,6927,6944,6951,6960,6971,6976,6982,6984,6986,6997,7071,7081,7092,7103,7160,7162,7163,7244,7248,7268,7282,7305,7322,7327,7328,7344,7346,7372,7411,7475,7517,7530,8171,8237,8238,8240,8246,8249,8259,8284,8306,8326,8334,8376,8386,8424,8426,9070,9095,9106,9114,9115,9131,9323,9363,9369,9380,9391,9399,9433,9450,9507,9508,9511,9513,9514,9518,9519,9521,9529,9530,9534,9540,9541,9549,9554,9556,9560,9568,9584,9587,9589,9597,9602,9611,9612,9613,9627,9628,9631,9637,9638,9656,9668,9673,9678,9683,9685,9692,9693,9694,9699,9700,9708,9726,9730,9731,9733,9736,9743,9744,9769,9774,9780,9782,9783,9792,9794,9796,9798,9799,9806,9807,9811,9814,9816,9820,9832,9837,9845,9846,9847,9848,9849,9860,9861,9868,9871,9873,9882,9886,9895,9900,9904,9941,9949,9951,9960,9968
,9981,9987,9990,9993,9995,9997,10002,10027,15591,15592"
reg <- "Chr3:6183403..6183803"

debug_tmp <- tempfile()

paste0("http://tools.1001genomes.org/api/v1/vcfsubset/strains/",
                      gts,
                      "/regions/",
                      reg,
                      "/type/fullgenome/format/vcf") %>% 
httr::GET(.) %>% httr::content() %>% 
 writeLines(., debug_tmp)

debug_vcf_info <- pegas::VCFloci(debug_tmp)

debug_vcf <-pegas::read.vcf(debug_tmp)

names(debug_vcf) <- paste(debug_vcf_info$CHROM, debug_vcf_info$POS, 
sep = ".")
pegas::LDscan(debug_vcf)

Do you have any idea how I can solve this?

Best Niklas

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/emmanuelparadis/pegas/issues/44

Pour nous remonter une erreur de filtrage, veuillez vous rendre ici : https://www.security-mail.net/ndc/reporter/mid/1047.5dbafb78.9d5e8.0/r/Emmanuel.Paradis@ird.fr/c/8699def65ed71598e11f9bfb6d9f228282f2b6fc

nschan commented 4 years ago

Hi Emmanuel,

thanks for answering. I don't think the problem is in writeLines, as it still occurs after downloading the file. In addition, a new issue has appeared: VCFloci() somehow seems to have issues allocating memory, and the R session crashes, this seems related to issue #28 and can be fixed by using absolute paths.

emmanuelparadis commented 4 years ago

Hi Niklas, I think this issue (absolute paths; I guess you are under Windows) has already been reported. Is is solved for you now?

emmanuelparadis commented 4 years ago

I've just tried what you want to do but with "more standard" R commands (I'm under Linux Ubuntu):

gts <-"742,763,772,854,997,1006,1063,1552,1757,1852,2171,2278,4807,4884,5210,5651,5718,5741,5768,5784,5800,5831,5890,5950,6012,6016,6040,6070,6071,6088,6099,6101,6106,6107,6123,6125,6134,6216,6220,6258,6276,6284,6296,6396,6830,6898,6903,6909,6911,6918,6924,6927,6944,6951,6960,6971,6976,6982,6984,6986,6997,7071,7081,7092,7103,7160,7162,7163,7244,7248,7268,7282,7305,7322,7327,7328,7344,7346,7372,7411,7475,7517,7530,8171,8237,8238,8240,8246,8249,8259,8284,8306,8326,8334,8376,8386,8424,8426,9070,9095,9106,9114,9115,9131,9323,9363,9369,9380,9391,9399,9433,9450,9507,9508,9511,9513,9514,9518,9519,9521,9529,9530,9534,9540,9541,9549,9554,9556,9560,9568,9584,9587,9589,9597,9602,9611,9612,9613,9627,9628,9631,9637,9638,9656,9668,9673,9678,9683,9685,9692,9693,9694,9699,9700,9708,9726,9730,9731,9733,9736,9743,9744,9769,9774,9780,9782,9783,9792,9794,9796,9798,9799,9806,9807,9811,9814,9816,9820,9832,9837,9845,9846,9847,9848,9849,9860,9861,9868,9871,9873,9882,9886,9895,9900,9904,9941,9949,9951,9960,9968,9981,9987,9990,9993,9995,9997,10002,10027,15591,15592"
reg <- "Chr3:6183403..6183803"
file <- paste0("http://tools.1001genomes.org/api/v1/vcfsubset/strains/",
                       gts,
                       "/regions/",
                       reg,
                       "/type/fullgenome/format/vcf")
download.file(file, "tmp.vcf")
library(pegas)
info <- VCFloci("tmp.vcf")
x <- read.vcf("tmp.vcf")

This works fine:

R> info
      CHROM     POS ID REF ALT QUAL FILTER            INFO   FORMAT
1         3 6183403  .   A   .   40   PASS DP=20138;AN=400 GT:GQ:DP
2         3 6183404  .   T   .   40   PASS DP=20100;AN=402 GT:GQ:DP
3         3 6183405  .   C   .   40   PASS DP=20061;AN=396 GT:GQ:DP
4         3 6183406  .   T   .   40   PASS DP=20132;AN=402 GT:GQ:DP
....                                                               
.....                                                              
398       3 6183800  .   T   .   40   PASS DP=24243;AN=410 GT:GQ:DP
399       3 6183801  .   C   .   40   PASS DP=24412;AN=414 GT:GQ:DP
400       3 6183802  .   A   .   40   PASS DP=24507;AN=410 GT:GQ:DP
401       3 6183803  .   A   .   40   PASS DP=24468;AN=414 GT:GQ:DP
R> x
Allelic data frame: 210 individuals
                    401 loci

I think this should work for you (assuming you use another OS) as long as you work within your current working directory (i.e., you don't play with paths in the file names).

nschan commented 4 years ago

Hi Emmanuel,

the relative path malloc() issue occurred on Debian Linux, here is the top of sessionInfo()

R version 3.6.1 (2019-07-05) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 9 (stretch)

Reading the file is not the problematic step, i can read the file just fine using my script, the problem occurs when calling LDscan(), if i use your script and then call pegas::LDscan(x) i get:

Scanning haplotypes... Error in dim(tmp) <- c(nh, nloc) : dims [product 802] do not match the length of object [797]

Best Niklas

emmanuelparadis commented 4 years ago

You're right: the crash was because VCFloci did not call path.expand (whereas read.vcf does). I've fixed that and pushed pegas 0.12-1 here.

For the second issue, I think it's because there are many missing data in the VCF file. You can already see that from the above output of VCFloci and it's clear when looking at the allele and genotype frequencies:

R> summary(x)[1]
$.
$.$genotype
A|A ./. 
200  10