databio / GenomicDistributions

Calculate and plot distributions of genomic ranges
http://code.databio.org/GenomicDistributions
Other
25 stars 10 forks source link

Trying to use calcChromBinsRef for a custom genome fails #205

Closed aminakur closed 1 year ago

aminakur commented 1 year ago

I am trying to use calcChromBinsRef to calculate the distribution of features in my custom reference genome. I first loaded the ref genome using: azucena <- read.fasta("~/Downloads/azucena.fna")

Then:

> x = calcChromBinsRef(granges, "azucena")
Error in getChromSizes(refAssembly) : 
  chromSizes_azucena not available in GenomicDistributions package or GenomicDistributionsData package, please use getChromSizesFromFasta() to get chromosome sizes.

I calculated the chromosome sizes using getChromSizesFromFasta:

chrom_sizes = getChromSizesFromFasta("~/Downloads/azucena.fna")

But received the same error message.

Next, I tried calcChromBinsRef with chrom_sizes:

> x = calcChromBinsRef(granges, chromSizes_azucena)
Error in .validateInputs(list(refAssembly = "character", query = c("GRanges",  : 
  refAssembly must be a character.  Got: integer

So how should I use my reference genome exactly?

kkupkova commented 1 year ago

Hi!

In your case you need to use calcChromBins function. The functions with "ref" suffix are available only for human and mouse.

The process is following: 1) calculate the chromosome sizes: chrom_sizes = getChromSizesFromFasta("~/Downloads/azucena.fna")

2) calculate equally sized bins across chromosomes (you can change the number or size of the bins): bins = getGenomeBins(chrom_sizes)

3) pass the bins to the calcChromBins function: x = calcChromBins(granges, bins)

To plot the distribution: plotChromBins(x)

Please let me know if you run into more issues or if you need assistance with some other functions.

Best, Kristyna

aminakur commented 1 year ago

Thank you! This worked. I have two more questions:

  1. How can I change the size of the bins? I understand that I can change the number using binCount parameter.
  2. Is it possible to extract bins (coordinates) with the highest/lowest counts of query features? Or to extract a table with the bins coordinates and their respective counts of features?
kkupkova commented 1 year ago

Glad that helped! 1) Sorry, that was my mistake. Current version of GenomicDistributions allows only binCount parameter. But you can always check the size of the bins straight by looking at the bins object and adjust the binCount accordingly to fit your desired bin width.

2) You can extract that from you calculated x object. The calcChromBins function returns a table, where the first three columns are bin coordinates, then there are two columns (regionID, withingGroupID) which just allow positioning of the bins for plotting, then there is column N, which has the information about the number of regions falling into a given bin (if you are running the function on multiple datasets, there is also an additional column name, which distinguishes the input queries).

To get the bins with the highest number of regions, you can sort the table as following and the bin with the highest number will be in the first row of the outputted table:

library(tidyverse)

highestSort = x %>% 
  arrange(desc(N))

To arrange from lowest to highest:

LowestSort = x %>% 
  arrange(N)
kkupkova commented 1 year ago

Fyi, you might also be interested in following vignette that was part of a Bioconductor 2022 workshop, which contains more specific instructions (in the second part of the workshop), how to work with custom genomes and how to edit plots.

BioC 2022 workshop vignette

aminakur commented 1 year ago

Thank you!