bmansfeld / QTLseqr

QTLseqr is an R package for QTL mapping using NGS Bulk Segregant Analysis
65 stars 42 forks source link

runQTLseqAnalysis error #55

Closed WallyL closed 9 months ago

WallyL commented 1 year ago

Hi Ben,

First, I realize this is a bit out there, but I thought I'd ask anyway and see if you might know why I see these errors.

I have installed QTLseqr and successfully processed the test rice dataset from the vignette. Everything seems to work.

I'm waiting on data for a large BSA-Seq expt; however, in the meantime, I had some old RNA-Seq data generated in horse that I had called SNPs on, so I thought I would just try to run that while waiting for the plant data, e.g. going through mapping, GATK, and VariantsToTable to make sure the front end was ready (that all worked). Obviously, this is not a BSA-Seq dataset, but I had previously found some putative, disease causative SNPs so, what the hell, I was curious if QTLSeq would identify them and point to the correct chromosomes.

There are only 6 individuals in the two groups, CTL and EXP (diseased). I pooled the individual fastqs for each group (~45Gb each) mapped with BWA and ran GATK (ver. 4.3)... per best practices, except I did not run BQSR. After converting VCF to table, I removed 1000's of small contigs and the X chromosome, keeping the 31 autosomes.

The graphics for the read depth, SNP index, Ref freq, etc... all look a bit weird, as I would expect given the input data. However, everything seemed to work okay, until runQTLseqAnalysis died. Below is the code used and following that is the error and one of the 16 warnings (I saw from an earlier issue here that the warnings I get are not likely the problem).

I'm really just curious if these errors are due to the fact that I'm trying to shoehorn a fake BSA-Seq dataset, or are they perhaps due to some other issue(s), e.g. the dataframe... i.e. do you think the filtration and/or run parameters could be tweaked and get this to run?

Thanks!

#Load required libs for pipeline and graphics
library("QTLseqr")
library("ggplot2")

#Import Data
rawData <- ("/home/project/QTLseqR/Ecab_snps_31_chr_mod.table")

#Define/name bulks and a vector for the chromosomes
HighBulk <- "EXP"
LowBulk <- "CTL"
#Make sure the chroms have correct prefix, e.g. "Chr" for next step.  
#The Equine test GATK table only has chromosome #'s, so added "Chr" with sed.  Also, sed NA to 0.
Chroms <- paste0(rep("Chr", 31), 1:31)

#Import data from GATK file- Use VariantsToTable after running GATK to create input.
df <- importFromGATK(file = rawData,highBulk = HighBulk,lowBulk = LowBulk,chromList = Chroms)

#Check the dataframe
head (df)
  CHROM POS REF ALT AD_REF.LOW AD_ALT.LOW DP.LOW GQ.LOW    PL.LOW SNPindex.LOW
1  Chr1 229   C   G          4          6     10     99 120,0,138    0.6000000
2  Chr1 233   G   T          9          3     12     66  66,0,346    0.2500000
3  Chr1 265   T   G          0         14     14     42  495,42,0    1.0000000
4  Chr1 311   T   C          0         10     10     30  386,30,0    1.0000000
5  Chr1 363   G   A         10          3     13     83  83,0,369    0.2307692
6  Chr1 466   A   T         21          5     26     99 115,0,867    0.1923077
  AD_REF.HIGH AD_ALT.HIGH DP.HIGH GQ.HIGH   PL.HIGH SNPindex.HIGH   REF_FRQ
1           9           1      10      11  11,0,333     0.1000000 0.6500000
2          11           0      11      33  0,33,417     0.0000000 0.8695652
3          10           7      17      99 226,0,292     0.4117647 0.3225806
4           9           5      14      99 130,0,281     0.3571429 0.3750000
5           9           0       9      27  0,27,327     0.0000000 0.8636364
6          11           0      11      33  0,33,495     0.0000000 0.8648649
    deltaSNP
1 -0.5000000
2 -0.2500000
3 -0.5882353
4 -0.6428571
5 -0.2307692
6 -0.1923077

#Filter the SNP set (These settings were used in the Rice test example data, may be problematic for this test)
df_filt <-
filterSNPs(
SNPset = df,
refAlleleFreq = 0.20,
minTotalDepth = 100,
maxTotalDepth = 400,
depthDifference = 100,
minSampleDepth = 40,
minGQ = 99,
verbose = TRUE
)

#Run QTLSeq Analysis

df_filt <- runQTLseqAnalysis(df_filt,
windowSize = 1e6,
popStruc = "F2",
bulkSize = c(6, 6),
replications = 10000,
intervals = c(95, 99)
)

Counting SNPs in each window...
Calculating tricube smoothed delta SNP index...
Error in `dplyr::mutate()`:
! Problem while computing `tricubeDeltaSNP = tricubeStat(POS = POS, Stat = deltaSNP, windowSize, ...)`.
ℹ The error occurred in group 1: CHROM = Chr1.
Caused by error in `lfproc()`:
! newsplit: out of vertex space
Run `rlang::last_error()` to see where the error occurred.

Backtrace:
  1. QTLseqr::runQTLseqAnalysis(...)
  8. QTLseqr:::tricubeStat(...)
 10. locfit::locfit(...)
 11. locfit (local) lfproc(...)

There were 16 warnings (use warnings() to see them)
> warnings()
Problem while computing `tricubeDeltaSNP = tricubeStat(POS = POS, Stat = deltaSNP, windowSize, ...)`.
ℹ procv: no points with non-zero weight
ℹ The warning occurred in group 1: CHROM = Chr1.
etc... x 16.
bmansfeld commented 1 year ago

Hey! Sorry for the slow reply. Sounds like you're having fun. Seems the smoothing algo ran out of vertex space ie window size too small. This could also be due to the data. But check this issue here - https://github.com/bmansfeld/QTLseqr/issues/9 and i'm pretty sure I mention this in the vignette too. In any case try to work with larger windows which will probably cause you to miss the causal snps (especially if they are really the only changes between EXP and CTL. Let me know if this at least removes the error. Ben

WallyL commented 1 year ago

Hi Ben,

Thanks for your response. Right after I submitted this question to the forum, I received the "real" data for a plant (peanut) project, so I have dropped working on this test dataset.

The plant sample group sizes are not what I would guess are typical for BSA-Seq, e.g. only ~ 20 individuals per group. The other caveat is that peanut is an allotetraploid that behaves more like an autotetraploid, so perhaps this pipeline won't work, or maybe other tweaks will be necessary. I went ahead tried ran to see if I got anything out the back end. thus far, I have called SNPs on three paired groups with GATK using the recommended GATK settings and employing the "--ploidy 4 " flag, and converted the VCF to table. The good news is I can now get past runQTLseqAnalysis (I get 50 warnings) and runGprimeAnalysis (same warnings), but plotQTLStats fails. I suspect this may be because after df filtering I'm only left with ca. 140 out of 1.3 million initial SNPs. Before I go into too much more detail about filtering and the plot error, should I start another issue on the forum or continue on this thread?

Thanks, Walt


From: Ben Mansfeld @.> Sent: Tuesday, December 27, 2022 10:39 PM To: bmansfeld/QTLseqr @.> Cc: Walt Lorenz @.>; Author @.> Subject: Re: [bmansfeld/QTLseqr] runQTLseqAnalysis error (Issue #55)

[EXTERNAL SENDER - PROCEED CAUTIOUSLY]

Hey! Sorry for the slow reply. Sounds like you're having fun. Seems the smoothing algo ran out of vertex space ie window size too small. This could also be due to the data. But check this issue here - #9https://github.com/bmansfeld/QTLseqr/issues/9 and i should have talked about this in the vignette. In any case try to work with larger windows which will probably cause you to miss the causal snps (especially if they are really the only changes between EXP and CTL. Let me know if this at least removes the error. Ben

— Reply to this email directly, view it on GitHubhttps://github.com/bmansfeld/QTLseqr/issues/55#issuecomment-1366343599, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG6H3CP22PQIQAC56BOXVDDWPOZANANCNFSM6AAAAAATDVTXAY. You are receiving this because you authored the thread.Message ID: @.***>

bmansfeld commented 1 year ago

yeah the warnings that you are getting have to do with the fact that you have filtered a lot of SNPs so that your windows are essentially empty. When you mean only 20 individuals per group, do you mean before bulking or after? ie were there 200 individuals originally and then 20 were selected for the group? if so this is pretty standard.

From what I've read and heard from others, GATK is kinda funky with it's --polyp option so i'm not sure it is crucial, but i'll defer to any other polyploid expert since I am not that.

btw why did you filter so many SNPs? Ben

WallyL commented 1 year ago

I'm referring to after bulking, i.e. there are only 20-40 individuals pooled in each bulk, depending on the comparison groups I'm looking at.

I have no idea why so many SNPs are being filtered out. One group has 3 million SNPs and I end up with 148. I started out using the parameters that are almost identical to what you had for the rice data. Relaxing the filtering as shown below only yields ca. 1,700 total SNPs and, strangely, the final results are different from what I get with the more stringent filtering. That is, I actually get fewer and different significant SNPs identified with the relaxed filtering than with stringent filtering. Perhaps I am doing something wrong here?

I've also tryed to modify the code to include a ploidy component as was suggested by Hugo using the nbinomial model, but I'm sure on how to implement it and not well versed in R... so far it keeps blowing up on me. Walt

"Relaxed Filtering Settings"

df_filt2 <-
filterSNPs(
SNPset = df,
refAlleleFreq = 0.10, #CHANGE
minTotalDepth = 20, #CHANGE
maxTotalDepth = 400,
depthDifference = 100,
minSampleDepth = 20, #CHANGE
minGQ = 75, #CHANGE
verbose = TRUE
)

############################################################## Filtering by reference allele frequency: 0.1 <= REF_FRQ <= 0.9 ...Filtered 688795 SNPs Filtering by total sample read depth: Total DP >= 20 ...Filtered 313740 SNPs Filtering by total sample read depth: Total DP <= 400 ...Filtered 494 SNPs Filtering by per sample read depth: DP >= 20 ...Filtered 1002152 SNPs Filtering by Genotype Quality: GQ >= 75 ...Filtered 611563 SNPs Filtering by difference between bulks <= 100 ...Filtered 23 SNPs Original SNP number: 2618491, Filtered: 2616767, Remaining: 1724 ##############################################################

"Stringent Filtering"

df_filt <-
filterSNPs(
SNPset = df,
refAlleleFreq = 0.20,
minTotalDepth = 40,
maxTotalDepth = 400,
depthDifference = 100,
minSampleDepth = 20,
minGQ = 99,
verbose = TRUE
)

############################################################## Filtering by reference allele frequency: 0.2 <= REF_FRQ <= 0.8 ...Filtered 490804 SNPs Filtering by total sample read depth: Total DP >= 40 ...Filtered 180687 SNPs Filtering by total sample read depth: Total DP <= 400 ...Filtered 217 SNPs Filtering by per sample read depth: DP >= 20 ...Filtered 36940 SNPs Filtering by Genotype Quality: GQ >= 99 ...Filtered 155385 SNPs Filtering by difference between bulks <= 100 ...Filtered 1 SNPs Original SNP number: 864182, Filtered: 864034, Remaining: 148 ##############################################################