Open rcastelo opened 9 months ago
Hi Robert,
It doesn't work here for this particular GAlignments object because its genome is unknown:
> seqinfo(gal1)
Seqinfo object with 2 sequences from an unspecified genome:
seqnames seqlengths isCircular genome
seq1 1575 NA <NA>
seq2 1584 NA <NA>
seqlevelsStyle(txdb) <- "RefSeq"
needs to know the current genome in order to be able to "translate" the chomosome names.
So it doesn't really have much to do with the fact that the object is a TxDb or GAlignments, you would get the same error with a TxDb object that has an unspecified genome.
For example with a simple GRanges object:
gr <- GRanges(c("chr4:11-299", "chr22_KQ759761v1_alt:25-50", "chrX_MU273397v1_alt:2500-9999"))
gr
# GRanges object with 3 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr4 11-299 *
# [2] chr22_KQ759761v1_alt 25-50 *
# [3] chrX_MU273397v1_alt 2500-9999 *
# -------
# seqinfo: 3 sequences from an unspecified genome; no seqlengths
seqlevelsStyle(gr) <- "RefSeq"
# Error in mapSeqlevels(x_seqlevels, value, drop = FALSE) :
# supplied seqname style "RefSeq" is not supported
genome(gr) <- "hg38"
seqinfo(gr)
# Seqinfo object with 3 sequences from hg38 genome; no seqlengths:
# seqnames seqlengths isCircular genome
# chr4 NA NA hg38
# chr22_KQ759761v1_alt NA NA hg38
# chrX_MU273397v1_alt NA NA hg38
seqlevelsStyle(gr) <- "RefSeq"
seqinfo(gr)
# Seqinfo object with 3 sequences from GRCh38.p14 genome; no seqlengths:
# seqnames seqlengths isCircular genome
# NC_000004.12 NA NA GRCh38.p14
# NW_015148968.1 NA NA GRCh38.p14
# NW_025791820.1 NA NA GRCh38.p14
Anyways, the error message is admittedly not very helpful here. I'll try to improve that...
I see, thanks Hervé. Looking at the code of the package of GenomicAlignments, the line that builds the Seqinfo
object here:
seqinfo <- Seqinfo(seqnames=names(seqlengths), seqlengths=seqlengths)
does not attempt to fill the genome
metadata column, probably because that's not something straightforward to do from a BAM file (googling for it, I've only found this Biostars post that says that one would need to figure out the genome version from sequence names and lengths). So, my understanding is that currently there is no way for readGAlignments()
to return an object with a value in the genome
metadata column.
The package VariantAnnotation
, for instance, deals with a similar problem with VCF files, by defining a class called VCFHeader
, with a reader method that attempts to figure out that genome version, and provides a seqinfo()
getter:
library(VariantAnnotation)
hdr <- scanVcfHeader(fl)
hdr
class: VCFHeader
samples(3): NA00001 NA00002 NA00003
meta(8): fileDate fileformat ... SAMPLE PEDIGREE
fixed(1): FILTER
info(6): NS DP ... DB H2
geno(4): GT GQ DP HQ
seqinfo(hdr)
Seqinfo object with 1 sequence from B36 genome:
seqnames seqlengths isCircular genome
20 62435964 NA B36
Wouldn't make sense for the package GenomicAlignments
to provide an analogous functionality? (knowing that probably the header of a VCF file has more specific information about the reference genome that the header of BAM file, but at least the user could provide that genome
value through an argument to the readGAlignments()
function)
Looking at the code of the package of GenomicAlignments, the line that builds the Seqinfo object etc...
Indeed. I just changed this in GenomicAlignments 1.39.4 so now readGAlignments(file, ...)
returns a GAlignments object with seqinfo(BamFile(file))
on it.
probably because that's not something straightforward to do from a BAM file
According to the SAM specs, the genome assembly can be specified via the AS
tag in each @SQ
header line of the file. I don't know how often that tag is actually present but it seems that the BAM files from the 1000 genomes project have it. For example, in this file, the @SQ
lines contain AS:NCBI37
:
library(Rsamtools)
bamfile <- BamFile("HG00096.chrom20.ILLUMINA.bwa.GBR.low_coverage.20120522.bam")
hdr <- scanBamHeader(bamfile)
cat(head(sapply(hdr$text[names(hdr$text) == "@SQ"], paste, collapse=" ")), sep="\n")
# SN:1 LN:249250621 M5:1b22b98cdeb4a9304cb5d48026a85128 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz AS:NCBI37 SP:Human
# SN:2 LN:243199373 M5:a0d9851da00400dec1098a9255ac712e UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz AS:NCBI37 SP:Human
# SN:3 LN:198022430 M5:fdfd811849cc2fadebc929bb925902e5 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz AS:NCBI37 SP:Human
# SN:4 LN:191154276 M5:23dccd106897542ad87d2765d28a19a1 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz AS:NCBI37 SP:Human
# SN:5 LN:180915260 M5:0740173db9ffd264d728f32784845cd7 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz AS:NCBI37 SP:Human
# SN:6 LN:171115067 M5:1d3a93a248d92a729ee764823acbbc6b UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz AS:NCBI37 SP:Human
However, there are at least two problems with this:
The first one is that, unfortunately, seqinfo(bamfile)
does not pick up the genome:
seqinfo(bamfile)
# Seqinfo object with 86 sequences from an unspecified genome:
# seqnames seqlengths isCircular genome
# 1 249250621 <NA> <NA>
# 2 243199373 <NA> <NA>
# 3 198022430 <NA> <NA>
# 4 191154276 <NA> <NA>
# 5 180915260 <NA> <NA>
# ... ... ... ...
# GL000194.1 191469 <NA> <NA>
# GL000225.1 211173 <NA> <NA>
# GL000192.1 547496 <NA> <NA>
# NC_007605 171823 <NA> <NA>
# hs37d5 35477943 <NA> <NA>
So that's something that would need to be fixed in Rsamtools. Then, once this is addressed, readGAlignments()
will pick it up too (thanks to the change I made in GenomicAlignments 1.39.4).
The second one is specific to this particular BAM file from the 1000 genomes project (others are probably affected but I didn't check that). The problem with this file it that they used a non-official name for the assembly (NCBI37
is not a valid assembly name, see https://www.ncbi.nlm.nih.gov/assembly/?term=NCBI37). The official name for this assembly is GRCh37
. This means that, even if seqinfo(bamfile)
gets fixed in Rsamtools, then seqlevelsStyle(gal) <- "RefSeq"
won't work out-of-the-box on the GAlignments object returned by readGAlignments()
. To make this work, one will first need to manually fix the object with genome(gal) <- "GRCh37"
.
at least the user could provide that
genome
value through an argument to thereadGAlignments()
function
Sure, but note that it's easy enough to set the genome
on the returned object. So adding this argument to readGAlignments()
won't add that much value.
hi Hervé, thanks a lot for the thorough analysis of the underlying problem, for the fix, and for starting the feature request on Rsamtools. Indeed, being able to do genome(gal) <- "GRCh37"
is sufficient and there's no need to add a genome argument to the function readGAlignments()
.
hi,
setting
seqlevelsStyle()
toRefSeq
works for, e.g.,TxDb
objects, but it gives an error withGenomicAlignment
objects. Here is a minimal repex:I guess if it is supported for
TxDb
objects, it should be also supported forGenomicAlignment
objects.