HenrikBengtsson / aroma.seq

🔬 R package: aroma.seq: High-Throughput Sequence Analysis using the Aroma Framework
https://github.com/HenrikBengtsson/aroma.seq
0 stars 1 forks source link

BamDataFileSet should report on number of unique BAM header target lengths finger prints #7

Closed HenrikBengtsson closed 9 years ago

HenrikBengtsson commented 9 years ago

When printing a BamDataFileSet object, it should report on number of unique BAM header target lengths finger prints in the set, e.g.

> library("aroma.seq")
> bams <- BamDataSet$byName("LG3-test", organism="HomoSapiens")
> ns <- lapply(bams, FUN=getTargetLengths)
> str(ns)
List of 9
 $ A06929_C019VACXX_6       : Named int [1:25] 247249719 135374737
134452384 132349534 114142980 ...
  ..- attr(*, "names")= chr [1:25] "1" "10" ...
 $ A06930_C019VACXX_7       : Named int [1:25] 247249719 135374737
134452384 132349534 114142980 ...
  ..- attr(*, "names")= chr [1:25] "1" "10" ...
 $ A08297_C08J9ACXX_2       : Named int [1:25] 247249719 135374737
134452384 132349534 114142980 ...
  ..- attr(*, "names")= chr [1:25] "1" "10" ...
 $ A15615_C0KVRACXX_4       : Named int [1:25] 247249719 135374737
134452384 132349534 114142980 ...
  ..- attr(*, "names")= chr [1:25] "1" "10" ...
 $ A15619_C0KWJACXX_8       : Named int [1:25] 247249719 135374737
134452384 132349534 114142980 ...
  ..- attr(*, "names")= chr [1:25] "1" "10" ...
 $ IX2393_C3ADYACXX_3_GATCAG: Named int [1:84] 249250621 243199373
198022430 191154276 180915260 ...
  ..- attr(*, "names")= chr [1:84] "1" "2" ...
 $ IX2399_C3EPTACXX_3_TAGCTT: Named int [1:84] 249250621 243199373
198022430 191154276 180915260 ...
  ..- attr(*, "names")= chr [1:84] "1" "2" ...
 $ Z00123                   : Named int [1:93] 249250621 135534747
135006516 40103 133851895 ...
  ..- attr(*, "names")= chr [1:93] "1" "10" ...
 $ Z00125                   : Named int [1:93] 249250621 135534747
135006516 40103 133851895 ...
  ..- attr(*, "names")= chr [1:93] "1" "10" ...

In the above example, there are 3 different sets of target length vectors.

Among other things, this will help warn the user that the BAM files might have been aligned with different genome references.

HenrikBengtsson commented 9 years ago
> library("aroma.seq")
> bams <- BamDataSet$byName("LG3-test", organism="HomoSapiens")
> ns <- lapply(bams, FUN=getTargetLengths)
> uns <- unique(ns)
> str(uns)
List of 3
 $ : Named int [1:25] 247249719 135374737 134452384 132349534 114142980 106368585 100338915 88827254 78774742 76117153 ...
  ..- attr(*, "names")= chr [1:25] "1" "10" "11" "12" ...
 $ : Named int [1:84] 249250621 243199373 198022430 191154276 180915260 171115067 159138663 146364022 141213431 135534747 ...
  ..- attr(*, "names")= chr [1:84] "1" "2" "3" "4" ...
 $ : Named int [1:93] 249250621 135534747 135006516 40103 133851895 115169878 107349540 102531392 90354753 81195210 ...
  ..- attr(*, "names")= chr [1:93] "1" "10" "11" "11_gl000202_random" ...

> types <- match(ns, table=uns)
> types
 [1] 1 1 1 1 1 2 2 3 3 3
> table(types)
types
1 2 3
5 2 3

So, there are n=5 of Type 1, n=2 of Type 2, and n=3 of Type 3.

HenrikBengtsson commented 9 years ago
> library("aroma.seq")
> bams <- BamDataSet$byName("LG3-test", organism="HomoSapiens")
> bams
BamDataSet:
Name: LG3-test
Tags:
Full name: LG3-test
Subpath: HomoSapiens
Number of files: 10
Names: A06929_C019VACXX_6, A06930_C019VACXX_7, A08297_C08J9ACXX_2, ..., Z00125 [10]
Path (to the first file): bamData/LG3-test/HomoSapiens
Total file size: 168.65 GB (181083190174 bytes)
RAM: 0.01MB
Number of unique target sequence sets: 3 [WARNING: unsafe to process if > 1]
Warning message:
In as.character.BamDataSet(x) :
  Processing a BAM data set that consists of (3) BAM files which have been mapped (10)
  different types of target sequences (different FASTA reference files) is unsafe and will
  likely lead to problems that are hard to troubleshoot: bamData/LG3-test/HomoSapiens
HenrikBengtsson commented 9 years ago

Numbers were swapped, now it outputs:

In as.character.BamDataSet(x) :
  Processing a BAM data set that consists of (10) BAM files which have been mapped (3)
  different types of target sequences (different FASTA reference files) is unsafe and will
  likely lead to problems that are hard to troubleshoot: bamData/LG3-test/HomoSapiens