HenrikBengtsson / aroma.seq

🔬 R package: aroma.seq: High-Throughput Sequence Analysis using the Aroma Framework
https://github.com/HenrikBengtsson/aroma.seq
0 stars 1 forks source link

Assert that BwaIndexSet is compatible with FastaReferenceFile #33

Closed HenrikBengtsson closed 8 years ago

HenrikBengtsson commented 8 years ago

It may happen that an incorrect pre-built BWA index set is picked up by buildBwaIndexSet(fa), e.g.

> fa
FastaReferenceFile:
Name: hg19
Tags:
Full name: hg19
Pathname: annotationData/organisms/Homo_sapiens/GRCh37,hg19/UCSC/hg19.fa
File size: 2.98 GiB (3199905909 bytes)
RAM: 0.01 MB
Has index file (*.fai): TRUE
Number of sequence contigs: 93
Sequence names: [93] chr1, chr10, chr11, ..., chrY
Sequence lengths (bp): [93] 249,250,621; 135,534,747; 135,006,516; ...; 59,373,566
Total sequence length (bp): 3,137,161,264
Ordering of sequence contigs (scores): 100% lexicographic, 100% canonical, 94.6% mixeddecimal, 92.5% mixedroman, 3.2% length

> is <- buildBwaIndexSet(fa)
> is
BwaIndexSet:
Name: bwa
Tags:
Full name: bwa
Number of files: 5
Names: hg19, hg19, hg19, hg19, hg19 [5]
Path (to the first file):
annotationData/organisms/Homo_sapiens/GRCh37,hg19/UCSC/bwa
Total file size: 5.11 GiB (5490044678 bytes)
RAM: 0.01MB
Index prefix: annotationData/organisms/Homo_sapiens/GRCh37,hg19/UCSC/bwa/hg19,no_chr
Organism: UCSC
Complete: TRUE

Note the "no_chr" tag. Indeed, the sequence names for the latter has no "chr" prefix;

cclc01{henrik}: tail hg19,no_chr.ann
0 Un_gl000247 (null)
2922402428 36422 0
0 Un_gl000248 (null)
2922438850 39786 0
0 Un_gl000249 (null)
2922478636 38502 0
0 X (null)
2922517138 155270560 23
0 Y (null)
3077787698 59373566 18

Thus, buildIndexSet(fa) should assert than the result is compatible with the FASTA file, i.e. compare sequence names and sequence lengths. For this we need to implement a parse for the *.ann file. We can then also output more info when printing the index set, e.g. (prototype):

> is
BwaIndexSet:
Name: bwa
Tags:
Full name: bwa
Number of files: 5
Names: hg19, hg19, hg19, hg19, hg19 [5]
Path (to the first file):
annotationData/organisms/Homo_sapiens/GRCh37,hg19/UCSC/bwa
Total file size: 5.11 GiB (5490044678 bytes)
RAM: 0.01MB
Index prefix: annotationData/organisms/Homo_sapiens/GRCh37,hg19/UCSC/bwa/hg19,no_chr
Organism: UCSC
Complete: TRUE
Number of sequence contigs: 93
Sequence names: [93] 1, 10, 11, ..., Y
Sequence lengths (bp): [93] 249,250,621; 135,534,747; 135,006,516; ...; 59,373,566
Total sequence length (bp): 3,137,161,264
Ordering of sequence contigs (scores): 100% lexicographic, 100% canonical, 94.6% mixeddecimal, 92.5% mixedroman, 3.2% length

just as for the FASTA file.

HenrikBengtsson commented 8 years ago

Another problem is that if none of the sequence names exists the isCompatibleWithBySeqNames() reports TRUE. This is because is:

    names <- getSeqNames(this, unique = unique)
    namesO <- getSeqNames(other, unique = unique)
    idxs <- match(namesO, names)
    res <- all(diff(idxs) > 0, na.rm = TRUE)
    if (!res) {
        attr(res, "reason") <- "The ordering of sequence names does not match."
    }

all idxs are NA, which makes diff(idxs) > 0 all NA, which makes res == TRUE. Fix this.

HenrikBengtsson commented 8 years ago

Added assertion of compatibility before returning BWA index set. Now we at least get:

> is <- buildBwaIndexSet(fa)
[2016-01-06 17:17:34] Exception: None of the sequence names matches.

  at #06. isCompatibleWithBySeqNames.SequenceContigsInterface(this, other,
              ...)
          - isCompatibleWithBySeqNames.SequenceContigsInterface() is in environment 'aroma.seq'

  at #05. isCompatibleWithBySeqNames(this, other, ...)
          - isCompatibleWithBySeqNames() is in environment 'aroma.seq'

  at #04. isCompatibleWith.FastaReferenceFile(this, res, mustWork = TRUE,
              verbose = less(verbose, 50))
          - isCompatibleWith.FastaReferenceFile() is in environment 'aroma.seq'

  at #03. isCompatibleWith(this, res, mustWork = TRUE, verbose = less(verbose,
              50))
          - isCompatibleWith() is in environment 'aroma.core'

  at #02. buildBwaIndexSet.FastaReferenceFile(fa)
          - buildBwaIndexSet.FastaReferenceFile() is in environment 'aroma.seq'

  at #01. buildBwaIndexSet(fa)
          - buildBwaIndexSet() is in environment 'aroma.seq'

Error: None of the sequence names matches.

Next step is to have it identify the other BWA index set, iff it exists, or otherwise generate a compatible one.