Closed loukesio closed 2 years ago
Hi Loukas, @loukesio
It's probably better to ask on the support.bioconductor.org website.
Have you looked at the DNAStringSet
object from RSamtools
?
suppressPackageStartupMessages(library(Rsamtools))
bamfile <- "~/Downloads/OneDrive-2022-03-09/test_4_seqs_sorted.bam"
bai <- "~/Downloads/OneDrive-2022-03-09/test_4_seqs_sorted.bam.bai"
bam <- BamFile(file = bamfile, index = bai, yieldSize = 20)
scanBam(bam, param = ScanBamParam(what = scanBamWhat()))
#> [[1]]
#> [[1]]$qname
#> [1] "NS500351:447:HKHV5AFX2:1:11209:10247:15355"
#> [2] "NS500351:447:HKHV5AFX2:1:21208:20038:9365"
#> [3] "NS500351:447:HKHV5AFX2:3:11407:16829:9372"
#> [4] "NS500351:447:HKHV5AFX2:4:11503:24882:5291"
#> [5] "NS500351:447:HKHV5AFX2:4:21605:5592:11298"
#>
#> [[1]]$flag
#> [1] 16 16 16 16 16
#>
#> [[1]]$rname
#> [1] Reference_barcodes Reference_barcodes Reference_barcodes Reference_barcodes
#> [5] Reference_barcodes
#> Levels: Reference_barcodes
#>
#> [[1]]$strand
#> [1] - - - - -
#> Levels: + - *
#>
#> [[1]]$pos
#> [1] 1 1 1 1 1
#>
#> [[1]]$qwidth
#> [1] 292 269 276 288 277
#>
#> [[1]]$mapq
#> [1] 254 254 254 254 254
#>
#> [[1]]$cigar
#> [1] "119I16M1D155M2I" "95I172M2I" "104I172M" "116I172M"
#> [5] "104I172M1I"
#>
#> [[1]]$mrnm
#> [1] <NA> <NA> <NA> <NA> <NA>
#> Levels: Reference_barcodes
#>
#> [[1]]$mpos
#> [1] NA NA NA NA NA
#>
#> [[1]]$isize
#> [1] 0 0 0 0 0
#>
#> [[1]]$seq
#> DNAStringSet object of length 5:
#> width seq
#> [1] 292 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...GTATTAGCTTACGACGCTACACCGATGCAACCA
#> [2] 269 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...GTATTAGCTTACGACGCTACACCGATCCTTGAT
#> [3] 276 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...TCGTATTAGCTTACGACGCTACACCACGGCACT
#> [4] 288 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...TCGTATTAGCTTACGACGCTACACCTGAGGCCT
#> [5] 277 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...CGTATTAGCTTACGACGCTACACCTGAGTTCTT
#>
#> [[1]]$qual
#> PhredQuality object of length 5:
#> width seq
#> [1] 292 ...
#> [2] 269 ...
#> [3] 276 ...
#> [4] 288 ...
#> [5] 277 ...
seqs <- scanBam(bam, param = ScanBamParam(what = "seq"))[[1]][[1]]
seqs
#> DNAStringSet object of length 5:
#> width seq
#> [1] 292 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...GTATTAGCTTACGACGCTACACCGATGCAACCA
#> [2] 269 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...GTATTAGCTTACGACGCTACACCGATCCTTGAT
#> [3] 276 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...TCGTATTAGCTTACGACGCTACACCACGGCACT
#> [4] 288 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...TCGTATTAGCTTACGACGCTACACCTGAGGCCT
#> [5] 277 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...CGTATTAGCTTACGACGCTACACCTGAGTTCTT
Created on 2022-03-09 by the reprex package (v2.0.1)
I'm not a biologist so I don't really understand your question.
FWIW, you can use, for example,
vmatchPattern("TACGGTTATATTGACAGACCGAGGG", seqs)
to get the locations of this pattern found in the sequences and
subseq(seqs, 54, 78)
to get positions in the observed data.
CC: @hpages Hervé is the expert and maintainer of Biostrings and he can definitely answer the question if there's enough information.
Best, Marcel
Agree with Marcel that this belongs on the support site (and that Hervé is the expert!), but I did
> library(GenomicAlignments)
>
> ## region of interest
> roi = GRanges("Reference_barcodes:54-78")
> param = ScanBamParam(which = roi, what = "seq")
>
> ## read sequences overlapping region of interest
> aln = readGAlignments("~/Downloads/test_4_seqs_sorted.bam", param = param)
> aln
GAlignments object with 5 alignments and 2 metadata columns:
seqnames strand cigar qwidth start end
<Rle> <Rle> <character> <integer> <integer> <integer>
[1] Reference_barcodes - 119I16M1D155M2I 292 1 172
[2] Reference_barcodes - 95I172M2I 269 1 172
[3] Reference_barcodes - 104I172M 276 1 172
[4] Reference_barcodes - 116I172M 288 1 172
[5] Reference_barcodes - 104I172M1I 277 1 172
width njunc | seq qname
<integer> <integer> | <DNAStringSet> <character>
[1] 172 0 | GGGGGGGGGG...GATGCAACCA NS500351:447:HKHV5AF..
[2] 172 0 | GGGGGGGGGG...GATCCTTGAT NS500351:447:HKHV5AF..
[3] 172 0 | GGGGGGGGGG...CCACGGCACT NS500351:447:HKHV5AF..
[4] 172 0 | GGGGGGGGGG...CCTGAGGCCT NS500351:447:HKHV5AF..
[5] 172 0 | GGGGGGGGGG...CTGAGTTCTT NS500351:447:HKHV5AF..
-------
seqinfo: 1 sequence from an unspecified genome
> names(aln) = mcols(aln)$qname
>
> ## 'map' region of interest to cigar; the 'ranges' represent coordinates on the read sequences
> map = mapToAlignments(roi, aln)
> map
GRanges object with 5 ranges and 2 metadata columns:
seqnames ranges strand | xHits alignmentsHits
<Rle> <IRanges> <Rle> | <integer> <integer>
[1] NS500351:447:HKHV5AF.. 172-196 * | 1 1
[2] NS500351:447:HKHV5AF.. 149-173 * | 1 2
[3] NS500351:447:HKHV5AF.. 158-182 * | 1 3
[4] NS500351:447:HKHV5AF.. 170-194 * | 1 4
[5] NS500351:447:HKHV5AF.. 158-182 * | 1 5
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths
>
> ## extract the subsequences of the aligned read
> subseq(mcols(aln)$seq, start(map), end(map))
DNAStringSet object of length 5:
width seq
[1] 25 TACGGTTATATTGACAGACCGAGGG
[2] 25 TACGGTTATATTGACAGACCGAGGG
[3] 25 TACGGTTATATTGACAGACCGAGGG
[4] 25 TAATTCGAAACCATAAGGCTCGCAA
[5] 25 TAATTCGAAACCATAAGGCTCGCAA
These almost but do not quite match your expectation, so I have done something wrong...
Thank you Martin!
What is which
in this example? Is it roi
with a different name?
reprex::reprex
may have caught that as well :)
Not if I'd edited the reproducible example after posting into github! which == roi
; I tried to update
FWIW I get the same thing as Martin:
library(GenomicAlignments)
stack <- stackStringsFromBam("test_4_seqs_sorted.bam", param=GRanges("Reference_barcodes:54-78"))
stack
# DNAStringSet object of length 5:
# width seq
# [1] 25 TACGGTTATATTGACAGACCGAGGG
# [2] 25 TACGGTTATATTGACAGACCGAGGG
# [3] 25 TACGGTTATATTGACAGACCGAGGG
# [4] 25 TAATTCGAAACCATAAGGCTCGCAA
# [5] 25 TAATTCGAAACCATAAGGCTCGCAA
H.
Thank you all! You are great! I think it works perfectly well. wow @hpages your solution is very very neat!!!!! is there a way to keep the qname as of the sequences as well? Then I ll be 100% able to verify my assumptions. Sorry if I am asking much.
Yes, call stackStringsFromBam()
with use.names=TRUE
. See ?stackStringsFromBam
for more information.
thank you @hpages I appreciate. I learnt a lot
My pleasure. Please ask on our support site for further questions about this. Thanks!
Dear all,
I have a bam file from which I want to extract the exact part of DNA sequences that map within the following region 54-78 to my reference .
I have tried so far unsuccessfully bedtools, samtools, custom scripts and I was wondering if there is a quick way to do it with RSamtools.
Here is a link to my data which is tiny, only 4 sequences, https://1drv.ms/u/s!ArOzUuixE-mggfskZPX_Y0cpyQH4RA?e=hAa0W3 and this is the desired result that I want to take out.
I just want to extract the sequences that map in the positions 54-78 in the genome and my data looks like this TACGGTTATATTGACAGACCGAGGG TAATTCGAAACCATAAGGCTCGCAA TACGGTTATATTGACAGACCGAGGG TAATTCGAAACCATAAGGCTCGCAA TACGGTTATATTGACAGACCGAGGG
I have managed to parse the file with Rsamtools and make it to a data.frame