bmansfeld / QTLseqr

QTLseqr is an R package for QTL mapping using NGS Bulk Segregant Analysis
64 stars 42 forks source link

ImportFromGATK refuses to take in non-numerical chromosomes #58

Closed Somatogenomics closed 1 year ago

Somatogenomics commented 1 year ago

Dear @bmansfeld,

I am facing problem in reading in my table with the ImportFromGATK function of QTLseqr. I have a non numeric chromosomes (scaffold level) and I have tried to convert them into numbers. However, I got nothing from the df dataframe after using "sed" to convert the first few prefix into "Chr", as all the chromosomes were removed before filtering. What I am dealing with looks like this CAJHJT010000001.1 up to CAJHJT010000071.1.

I would appreciate any help towards fixing this issue.

Error encountered without conversion with sed:

No such file or directoryError in file(file, "rt") : cannot open the connection

With converted chrs:

Removing the following chromosomes: Chr01.1, Chr02.1, Chr03.1, Chr04.1, Chr05.1, Chr06.1, Chr07.1, Chr08.1, Chr09.1, Chr10.1, Chr11.1, Chr12.1, and so on.

Any help would be highly appreciated.

bmansfeld commented 1 year ago

Hi, thanks for the question. Not sure I follow. Could you please paste the exact command you ran when you got the error? You should be able to just import all the chroms form the file without telling importfromGatk which to choose. The error here suggest the file is missing from the working directory or something. If you share the head of your GATK table file I could see also some other reasons why this is breaking. Hopefully should be a quick fix, Ben

Somatogenomics commented 1 year ago

Thank you for your response.

Here are the commands:

HighBulk <- "wptslFF21" LowBulk <- "wptslFF26EC" file <- "input.table"

Chroms <- paste0(rep("Chr", 6), 1:6)

Import SNP data from file

df <- importFromGATK( file = file, highBulk = HighBulk, lowBulk = LowBulk, chromList = Chroms )

I got the error mentioned from my previous post after the above commands. I have the file in my working directory.

After my question yesterday, I went on to use sed to convert the "CAJHJT0100000" part of the scaffolds into "Chr" and went ahead to use the commands above. However, some chromosomes were removed:

Removing the following chromosomes: Chr01.1, Chr02.1, Chr03.1, Chr04.1, Chr05.1, Chr06.1, Chr07.1, Chr08.1, Chr09.1

I got the above when I ran the ImportFromGATK function.

This subsequently affected the downstream analysis and affected the results I got.

I will try and import without removing anything. To do that, does that mean there is no need of putting the chromList= Chroms option?

Finally, I have attached a snapshot of my data below.

QTLseqr

Thank you for the feedback.

bmansfeld commented 1 year ago

Hi, chromList is a list of only the chromosomes you want to include in the analysis. Since the analysis usually assumes large window sizes of ~20cM or at least 1Mb+ depending on your spp., it is usually recommended to exclude any non chromosome length scaffolds. So chromList should be those you want to keep!

If I understand your code and example correctly I would assume you want your first 6 scaffolds? in that case there is no need to sed anything and everything can be cleaned up in R. You should set something like:

nScaff <- 6
Chroms <- paste0("CAJHJT01", stringr::str_pad(width = 6, pad = "0", string = 1:nScaff), ".1")

Then importFromGATK(chromList = Chroms, ...) will only include scaffolds with the chrom id of "CAJHJT01000001.1" "CAJHJT01000002.1" "CAJHJT01000003.1" "CAJHJT01000004.1" "CAJHJT01000005.1" "CAJHJT01000006.1"... until nScaff.

I really dont recommend running the analysis on 71 small scaffolds that will be very hard to interpret.

Hope that makes sense. Let me know if that works for you Ben

Somatogenomics commented 1 year ago

Thank you for your time.

After using this:

nScaff <- 6 Chroms <- paste0("CAJHJT01", stringr::str_pad(width = 6, pad = "0", string = 1:nScaff), ".1")

I then ran:

df <- importFromGATK( file = file, highBulk = HighBulk, lowBulk = LowBulk, chromList = Chroms )

All the scaffolds were removed:

Removing the following chromosomes: CAJHJT010000001.1, CAJHJT010000002.1, CAJHJT010000003.1, CAJHJT010000004.1, CAJHJT010000005.1, CAJHJT010000006.1, CAJHJT010000007.1, CAJHJT010000008.1, CAJHJT010000009.1, CAJHJT010000010.1, CAJHJT010000011.1, CAJHJT010000012.1, CAJHJT010000013.1, CAJHJT010000014.1, CAJHJT010000015.1, CAJHJT010000016.1, CAJHJT010000018.1, CAJHJT010000020.1, CAJHJT010000021.1, CAJHJT010000022.1, CAJHJT010000023.1, CAJHJT010000024.1, CAJHJT010000025.1, CAJHJT010000026.1, CAJHJT010000027.1, CAJHJT010000028.1, CAJHJT010000029.1, CAJHJT010000030.1, CAJHJT010000031.1, CAJHJT010000032.1, CAJHJT010000033.1, CAJHJT010000034.1, CAJHJT010000035.1, CAJHJT010000036.1, CAJHJT010000037.1, CAJHJT010000038.1, CAJHJT010000039.1, CAJHJT010000040.1, CAJHJT010000041.1, CAJHJT010000042.1, CAJHJT010000043.1,.............................................

dim(df) gives 0, 18. That is, no chromosome was present.

bmansfeld commented 1 year ago

Sorry width should be set to 7 not 6 in str_pad()

Somatogenomics commented 1 year ago

Thank you. That worked. However, I encountered another error while plotting for both the G´ and the QTLseq analyses.

plotQTLStats(SNPset = df_filt, var = "Gprime", plotThreshold = TRUE, q = 0.5) Warning: The q threshold is too low. No threshold line will be drawnError in get_labels(): ! breaks and labels are different lengths Backtrace:

  1. base (local) <fn>(x)
  2. ggplot2:::print.ggplot(x)
  3. ggplot2:::ggplot_gtable.ggplot_built(data)
  4. layout$setup_panel_guides(plot$guides, plot$layers, plot$mapping)
  5. ggplot2 (local) setup_panel_guides(..., self = self) ...
  6. ggplot2:::guide_train.axis(guide, panel_params[[aesthetic]])
  7. scale$get_labels(breaks)
  8. ggplot2 (local) get_labels(..., self = self)
  9. self$scale$get_labels(breaks)
  10. ggplot2 (local) get_labels(..., self = self)
  11. plotQTLStats(SNPset = df_filt, var = "deltaSNP", plotIntervals = TRUE) Error in get_labels(): ! breaks and labels are different lengths Backtrace:

  12. base (local) <fn>(x)
  13. ggplot2:::print.ggplot(x)
  14. ggplot2:::ggplot_gtable.ggplot_built(data)
  15. layout$setup_panel_guides(plot$guides, plot$layers, plot$mapping)
  16. ggplot2 (local) setup_panel_guides(..., self = self) ...
  17. ggplot2:::guide_train.axis(guide, panel_params[[aesthetic]])
  18. scale$get_labels(breaks)
  19. ggplot2 (local) get_labels(..., self = self)
  20. self$scale$get_labels(breaks)
  21. ggplot2 (local) get_labels(..., self = self)
Somatogenomics commented 1 year ago

@bmansfeld ,

Thank you for your time.

I extracted only one scaffold instead of the six above, I am still getting the same error as above.

the codes

HighBulk <- "wptslFF21" LowBulk <- "wptslFF26EC" file <- "input.table"

scaffold_ID <- "CAJHJT010000005.1" Chroms<- scaffold_ID

Import SNP data from file

df <- importFromGATK( file = file, highBulk = HighBulk, lowBulk = LowBulk, chromList = Chroms )

The code above successfully removed all other scaffolds, leaving the only designated one behind.

Filter SNPs based on some criteria

df_filt <- filterSNPs( SNPset = df, refAlleleFreq = 0.20, minTotalDepth = 100, maxTotalDepth = 300, minSampleDepth = 30, minGQ = 99 )

This left me with some snps.

Run G' analysis

df_filt <- runGprimeAnalysis( SNPset = df_filt, windowSize = 1e6, outlierFilter = "deltaSNP")

and this:

Run QTLseq analysis

df_filt <- runQTLseqAnalysis( SNPset = df_filt, windowSize = 1e6, popStruc = "F2", bulkSize = c(25, 25), replications = 10000, intervals = c(95, 99)

)

The plotting was where the problem comes up.

plotQTLStats(SNPset = df_filt, var = "Gprime", plotThreshold = TRUE, q = 0.8) plotQTLStats(SNPset = df_filt, var = "deltaSNP", plotIntervals = TRUE)

And the error

plotQTLStats(SNPset = df_filt, var = "Gprime", plotThreshold = TRUE, q = 0.8) Error in get_labels(): ! breaks and labels are different lengths Backtrace:

  1. base (local) <fn>(x)
  2. ggplot2:::print.ggplot(x)
  3. ggplot2:::ggplot_gtable.ggplot_built(data)
  4. layout$setup_panel_guides(plot$guides, plot$layers, plot$mapping)
  5. ggplot2 (local) setup_panel_guides(..., self = self) ...
    1. ggplot2:::guide_train.axis(guide, panel_params[[aesthetic]])
    2. scale$get_labels(breaks)
    3. ggplot2 (local) get_labels(..., self = self)
    4. self$scale$get_labels(breaks)
    5. ggplot2 (local) get_labels(..., self = self)

      plotQTLStats(SNPset = df_filt, var = "deltaSNP", plotIntervals = TRUE) Error in get_labels(): ! breaks and labels are different lengths Backtrace:

  6. base (local) <fn>(x)
  7. ggplot2:::print.ggplot(x)
  8. ggplot2:::ggplot_gtable.ggplot_built(data)
  9. layout$setup_panel_guides(plot$guides, plot$layers, plot$mapping)
  10. ggplot2 (local) setup_panel_guides(..., self = self) ...
    1. ggplot2:::guide_train.axis(guide, panel_params[[aesthetic]])
    2. scale$get_labels(breaks)
    3. ggplot2 (local) get_labels(..., self = self)
    4. self$scale$get_labels(breaks)
    5. ggplot2 (local) get_labels(..., self = self)

Thank you for your anticipated help.

bmansfeld commented 1 year ago

Hi, It's really hard to understand what the problem is with out the data set. First thing i notice though is that you are setting q = 0.8 which is essentially the exact opposite of what you want. you need a q of 0.1 at the highest.

Not sure what else is wrong with the dataframe that is causing ggplot to glitch here. If you wouldn't mind sharing just that one chromosome worth of data I could take a look.

If I were you I would try manually plotting the results of the analyses in your own ggplot script. All the data you need are in the df_filt dataframe after you run the respective analyses. Let me know if there is anything else I can help with.