Bioconductor / GenomeInfoDb

Utilities for manipulating chromosome names, including modifying them to follow a particular naming style
https://bioconductor.org/packages/GenomeInfoDb
30 stars 13 forks source link

Error on setting `seqlevelsStyle(gal) <- "RefSeq"` #105

Open rcastelo opened 9 months ago

rcastelo commented 9 months ago

hi,

setting seqlevelsStyle() to RefSeq works for, e.g., TxDb objects, but it gives an error with GenomicAlignment objects. Here is a minimal repex:

library(GenomicAlignments)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
seqlevelsStyle(txdb) <- "RefSeq"          ## this seems to work fine
Warning message:
In (function (seqlevels, genome, new_style)  :
  cannot switch some hg38's seqlevels from UCSC to RefSeq style
head(seqnames(seqinfo(txdb)))
[1] "NC_000001.11" "NC_000002.12" "NC_000003.12" "NC_000004.12" "NC_000005.10"
[6] "NC_000006.12"

bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",mustWork=TRUE)
gal1 <- readGAlignments(bamfile)
seqlevelsStyle(gal1) <- "RefSeq"           ## this prompts an error
Error in mapSeqlevels(x_seqlevels, value, drop = FALSE) : 
  supplied seqname style "RefSeq" is not supported

I guess if it is supported for TxDb objects, it should be also supported for GenomicAlignment objects.

hpages commented 9 months ago

Hi Robert,

It doesn't work here for this particular GAlignments object because its genome is unknown:

> seqinfo(gal1)
Seqinfo object with 2 sequences from an unspecified genome:
  seqnames seqlengths isCircular genome
  seq1           1575         NA   <NA>
  seq2           1584         NA   <NA>

seqlevelsStyle(txdb) <- "RefSeq" needs to know the current genome in order to be able to "translate" the chomosome names.

So it doesn't really have much to do with the fact that the object is a TxDb or GAlignments, you would get the same error with a TxDb object that has an unspecified genome.

For example with a simple GRanges object:

gr  <- GRanges(c("chr4:11-299", "chr22_KQ759761v1_alt:25-50", "chrX_MU273397v1_alt:2500-9999"))
gr
# GRanges object with 3 ranges and 0 metadata columns:
#                   seqnames    ranges strand
#                      <Rle> <IRanges>  <Rle>
#   [1]                 chr4    11-299      *
#   [2] chr22_KQ759761v1_alt     25-50      *
#   [3]  chrX_MU273397v1_alt 2500-9999      *
#   -------
#   seqinfo: 3 sequences from an unspecified genome; no seqlengths

seqlevelsStyle(gr) <- "RefSeq" 
# Error in mapSeqlevels(x_seqlevels, value, drop = FALSE) : 
#   supplied seqname style "RefSeq" is not supported

genome(gr) <- "hg38"
seqinfo(gr)
# Seqinfo object with 3 sequences from hg38 genome; no seqlengths:
#   seqnames             seqlengths isCircular genome
#   chr4                         NA         NA   hg38
#   chr22_KQ759761v1_alt         NA         NA   hg38
#   chrX_MU273397v1_alt          NA         NA   hg38

seqlevelsStyle(gr) <- "RefSeq" 
seqinfo(gr)
# Seqinfo object with 3 sequences from GRCh38.p14 genome; no seqlengths:
#   seqnames       seqlengths isCircular     genome
#   NC_000004.12           NA         NA GRCh38.p14
#   NW_015148968.1         NA         NA GRCh38.p14
#   NW_025791820.1         NA         NA GRCh38.p14

Anyways, the error message is admittedly not very helpful here. I'll try to improve that...

rcastelo commented 8 months ago

I see, thanks Hervé. Looking at the code of the package of GenomicAlignments, the line that builds the Seqinfo object here:

seqinfo <- Seqinfo(seqnames=names(seqlengths), seqlengths=seqlengths)

does not attempt to fill the genome metadata column, probably because that's not something straightforward to do from a BAM file (googling for it, I've only found this Biostars post that says that one would need to figure out the genome version from sequence names and lengths). So, my understanding is that currently there is no way for readGAlignments() to return an object with a value in the genome metadata column.

The package VariantAnnotation, for instance, deals with a similar problem with VCF files, by defining a class called VCFHeader, with a reader method that attempts to figure out that genome version, and provides a seqinfo() getter:

library(VariantAnnotation)

hdr <- scanVcfHeader(fl)
hdr
class: VCFHeader 
samples(3): NA00001 NA00002 NA00003
meta(8): fileDate fileformat ... SAMPLE PEDIGREE
fixed(1): FILTER
info(6): NS DP ... DB H2
geno(4): GT GQ DP HQ
seqinfo(hdr)
Seqinfo object with 1 sequence from B36 genome:
  seqnames seqlengths isCircular genome
  20         62435964         NA    B36

Wouldn't make sense for the package GenomicAlignments to provide an analogous functionality? (knowing that probably the header of a VCF file has more specific information about the reference genome that the header of BAM file, but at least the user could provide that genome value through an argument to the readGAlignments() function)

hpages commented 8 months ago

Looking at the code of the package of GenomicAlignments, the line that builds the Seqinfo object etc...

Indeed. I just changed this in GenomicAlignments 1.39.4 so now readGAlignments(file, ...) returns a GAlignments object with seqinfo(BamFile(file)) on it.

probably because that's not something straightforward to do from a BAM file

According to the SAM specs, the genome assembly can be specified via the AS tag in each @SQ header line of the file. I don't know how often that tag is actually present but it seems that the BAM files from the 1000 genomes project have it. For example, in this file, the @SQ lines contain AS:NCBI37:

library(Rsamtools)
bamfile <- BamFile("HG00096.chrom20.ILLUMINA.bwa.GBR.low_coverage.20120522.bam")
hdr <- scanBamHeader(bamfile)

cat(head(sapply(hdr$text[names(hdr$text) == "@SQ"], paste, collapse=" ")), sep="\n")
# SN:1 LN:249250621 M5:1b22b98cdeb4a9304cb5d48026a85128 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human
# SN:2 LN:243199373 M5:a0d9851da00400dec1098a9255ac712e UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human
# SN:3 LN:198022430 M5:fdfd811849cc2fadebc929bb925902e5 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human
# SN:4 LN:191154276 M5:23dccd106897542ad87d2765d28a19a1 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human
# SN:5 LN:180915260 M5:0740173db9ffd264d728f32784845cd7 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human
# SN:6 LN:171115067 M5:1d3a93a248d92a729ee764823acbbc6b UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human

However, there are at least two problems with this:

at least the user could provide that genome value through an argument to the readGAlignments() function

Sure, but note that it's easy enough to set the genome on the returned object. So adding this argument to readGAlignments() won't add that much value.

rcastelo commented 8 months ago

hi Hervé, thanks a lot for the thorough analysis of the underlying problem, for the fix, and for starting the feature request on Rsamtools. Indeed, being able to do genome(gal) <- "GRCh37" is sufficient and there's no need to add a genome argument to the function readGAlignments().