Is there a way to extract part of the sequence that maps to a specific area in the reference!

loukesio commented 2 years ago

Dear all,

I have a bam file from which I want to extract the exact part of DNA sequences that map within the following region 54-78 to my reference .

I have tried so far unsuccessfully bedtools, samtools, custom scripts and I was wondering if there is a quick way to do it with RSamtools.

Here is a link to my data which is tiny, only 4 sequences, https://1drv.ms/u/s!ArOzUuixE-mggfskZPX_Y0cpyQH4RA?e=hAa0W3 and this is the desired result that I want to take out.

I just want to extract the sequences that map in the positions 54-78 in the genome and my data looks like this TACGGTTATATTGACAGACCGAGGG TAATTCGAAACCATAAGGCTCGCAA TACGGTTATATTGACAGACCGAGGG TAATTCGAAACCATAAGGCTCGCAA TACGGTTATATTGACAGACCGAGGG

I have managed to parse the file with Rsamtools and make it to a data.frame

bam <- scanBam("test_4_seqs_sorted.bam")

#names of the BAM fields
names(bam[[1]])

#distribution of BAM flags
table(bam[[1]]$flag)

.unlist <- function (x){
  ## do.call(c, ...) coerces factor to integer, which is undesired
  x1 <- x[[1L]]
  if (is.factor(x1)){
    structure(unlist(x), class = "factor", levels = levels(x1))
  } else {
    do.call(c, x)
  }
}

#store names of BAM fields
bam_field <- names(bam[[1]])

#go through each BAM field and unlist
list <- lapply(bam_field, function(y) .unlist(lapply(bam, "[[", y)))

#store as data frame
bam_df <- do.call("DataFrame", list)
names(bam_df) <- bam_field

bam_df

LiNk-NY commented 2 years ago

Hi Loukas, @loukesio

It's probably better to ask on the support.bioconductor.org website. Have you looked at the DNAStringSet object from RSamtools?

suppressPackageStartupMessages(library(Rsamtools))

bamfile <- "~/Downloads/OneDrive-2022-03-09/test_4_seqs_sorted.bam"
bai <- "~/Downloads/OneDrive-2022-03-09/test_4_seqs_sorted.bam.bai"

bam <- BamFile(file = bamfile, index = bai, yieldSize = 20)

scanBam(bam, param = ScanBamParam(what = scanBamWhat()))
#> [[1]]
#> [[1]]$qname
#> [1] "NS500351:447:HKHV5AFX2:1:11209:10247:15355"
#> [2] "NS500351:447:HKHV5AFX2:1:21208:20038:9365" 
#> [3] "NS500351:447:HKHV5AFX2:3:11407:16829:9372" 
#> [4] "NS500351:447:HKHV5AFX2:4:11503:24882:5291" 
#> [5] "NS500351:447:HKHV5AFX2:4:21605:5592:11298" 
#> 
#> [[1]]$flag
#> [1] 16 16 16 16 16
#> 
#> [[1]]$rname
#> [1] Reference_barcodes Reference_barcodes Reference_barcodes Reference_barcodes
#> [5] Reference_barcodes
#> Levels: Reference_barcodes
#> 
#> [[1]]$strand
#> [1] - - - - -
#> Levels: + - *
#> 
#> [[1]]$pos
#> [1] 1 1 1 1 1
#> 
#> [[1]]$qwidth
#> [1] 292 269 276 288 277
#> 
#> [[1]]$mapq
#> [1] 254 254 254 254 254
#> 
#> [[1]]$cigar
#> [1] "119I16M1D155M2I" "95I172M2I"       "104I172M"        "116I172M"       
#> [5] "104I172M1I"     
#> 
#> [[1]]$mrnm
#> [1] <NA> <NA> <NA> <NA> <NA>
#> Levels: Reference_barcodes
#> 
#> [[1]]$mpos
#> [1] NA NA NA NA NA
#> 
#> [[1]]$isize
#> [1] 0 0 0 0 0
#> 
#> [[1]]$seq
#> DNAStringSet object of length 5:
#>     width seq
#> [1]   292 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...GTATTAGCTTACGACGCTACACCGATGCAACCA
#> [2]   269 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...GTATTAGCTTACGACGCTACACCGATCCTTGAT
#> [3]   276 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...TCGTATTAGCTTACGACGCTACACCACGGCACT
#> [4]   288 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...TCGTATTAGCTTACGACGCTACACCTGAGGCCT
#> [5]   277 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...CGTATTAGCTTACGACGCTACACCTGAGTTCTT
#> 
#> [[1]]$qual
#> PhredQuality object of length 5:
#>     width seq
#> [1]   292                                   ...                                 
#> [2]   269                                   ...                                 
#> [3]   276                                   ...                                 
#> [4]   288                                   ...                                 
#> [5]   277                                   ...
seqs <- scanBam(bam, param = ScanBamParam(what = "seq"))[[1]][[1]]
seqs
#> DNAStringSet object of length 5:
#>     width seq
#> [1]   292 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...GTATTAGCTTACGACGCTACACCGATGCAACCA
#> [2]   269 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...GTATTAGCTTACGACGCTACACCGATCCTTGAT
#> [3]   276 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...TCGTATTAGCTTACGACGCTACACCACGGCACT
#> [4]   288 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...TCGTATTAGCTTACGACGCTACACCTGAGGCCT
#> [5]   277 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...CGTATTAGCTTACGACGCTACACCTGAGTTCTT

^{Created on 2022-03-09 by the reprex package (v2.0.1)}

I'm not a biologist so I don't really understand your question.

FWIW, you can use, for example,

vmatchPattern("TACGGTTATATTGACAGACCGAGGG", seqs)

to get the locations of this pattern found in the sequences and

subseq(seqs, 54, 78)

to get positions in the observed data.

CC: @hpages Hervé is the expert and maintainer of Biostrings and he can definitely answer the question if there's enough information.

Best, Marcel

mtmorgan commented 2 years ago

Agree with Marcel that this belongs on the support site (and that Hervé is the expert!), but I did

> library(GenomicAlignments)
> 
> ## region of interest
> roi = GRanges("Reference_barcodes:54-78")
> param = ScanBamParam(which = roi, what = "seq")
>
> ## read sequences overlapping region of interest
> aln = readGAlignments("~/Downloads/test_4_seqs_sorted.bam", param = param)
> aln
GAlignments object with 5 alignments and 2 metadata columns:
                seqnames strand           cigar    qwidth     start       end
                   <Rle>  <Rle>     <character> <integer> <integer> <integer>
  [1] Reference_barcodes      - 119I16M1D155M2I       292         1       172
  [2] Reference_barcodes      -       95I172M2I       269         1       172
  [3] Reference_barcodes      -        104I172M       276         1       172
  [4] Reference_barcodes      -        116I172M       288         1       172
  [5] Reference_barcodes      -      104I172M1I       277         1       172
          width     njunc |                     seq                  qname
      <integer> <integer> |          <DNAStringSet>            <character>
  [1]       172         0 | GGGGGGGGGG...GATGCAACCA NS500351:447:HKHV5AF..
  [2]       172         0 | GGGGGGGGGG...GATCCTTGAT NS500351:447:HKHV5AF..
  [3]       172         0 | GGGGGGGGGG...CCACGGCACT NS500351:447:HKHV5AF..
  [4]       172         0 | GGGGGGGGGG...CCTGAGGCCT NS500351:447:HKHV5AF..
  [5]       172         0 | GGGGGGGGGG...CTGAGTTCTT NS500351:447:HKHV5AF..
  -------
  seqinfo: 1 sequence from an unspecified genome
> names(aln) = mcols(aln)$qname
>
> ## 'map' region of interest to cigar; the 'ranges' represent coordinates on the read sequences
> map = mapToAlignments(roi, aln)
> map
GRanges object with 5 ranges and 2 metadata columns:
                    seqnames    ranges strand |     xHits alignmentsHits
                       <Rle> <IRanges>  <Rle> | <integer>      <integer>
  [1] NS500351:447:HKHV5AF..   172-196      * |         1              1
  [2] NS500351:447:HKHV5AF..   149-173      * |         1              2
  [3] NS500351:447:HKHV5AF..   158-182      * |         1              3
  [4] NS500351:447:HKHV5AF..   170-194      * |         1              4
  [5] NS500351:447:HKHV5AF..   158-182      * |         1              5
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths
>
> ## extract the subsequences of the aligned read
> subseq(mcols(aln)$seq, start(map), end(map))
DNAStringSet object of length 5:
    width seq
[1]    25 TACGGTTATATTGACAGACCGAGGG
[2]    25 TACGGTTATATTGACAGACCGAGGG
[3]    25 TACGGTTATATTGACAGACCGAGGG
[4]    25 TAATTCGAAACCATAAGGCTCGCAA
[5]    25 TAATTCGAAACCATAAGGCTCGCAA

These almost but do not quite match your expectation, so I have done something wrong...

LiNk-NY commented 2 years ago

Thank you Martin!

What is which in this example? Is it roi with a different name? reprex::reprex may have caught that as well :)

mtmorgan commented 2 years ago

Not if I'd edited the reproducible example after posting into github! which == roi; I tried to update

hpages commented 2 years ago

FWIW I get the same thing as Martin:

library(GenomicAlignments)
stack <- stackStringsFromBam("test_4_seqs_sorted.bam", param=GRanges("Reference_barcodes:54-78"))
stack
# DNAStringSet object of length 5:
#     width seq
# [1]    25 TACGGTTATATTGACAGACCGAGGG
# [2]    25 TACGGTTATATTGACAGACCGAGGG
# [3]    25 TACGGTTATATTGACAGACCGAGGG
# [4]    25 TAATTCGAAACCATAAGGCTCGCAA
# [5]    25 TAATTCGAAACCATAAGGCTCGCAA

H.

loukesio commented 2 years ago

Thank you all! You are great! I think it works perfectly well. wow @hpages your solution is very very neat!!!!! is there a way to keep the qname as of the sequences as well? Then I ll be 100% able to verify my assumptions. Sorry if I am asking much.

hpages commented 2 years ago

Yes, call stackStringsFromBam() with use.names=TRUE. See ?stackStringsFromBam for more information.

loukesio commented 2 years ago

thank you @hpages I appreciate. I learnt a lot

hpages commented 2 years ago

My pleasure. Please ask on our support site for further questions about this. Thanks!

Bioconductor / Rsamtools

Is there a way to extract part of the sequence that maps to a specific area in the reference! #31