HenrikBengtsson / aroma.seq

🔬 R package: aroma.seq: High-Throughput Sequence Analysis using the Aroma Framework
https://github.com/HenrikBengtsson/aroma.seq
0 stars 1 forks source link

Add protection for non-matching target name formats, e.g. 'chrX' vs 'X' #2

Open HenrikBengtsson opened 9 years ago

HenrikBengtsson commented 9 years ago

It's such a common mistake to run a pipeline where different annotation files (e.g. FASTA and GC content) using different conventions for naming targets, e.g. the FASTA and hence the aligned BAM files uses 1, 2, 3, ..., X, whereas a GC content data file contains chr1, chr2, ..., chrX.

TO DO: Add protection/validation against using non-matching files such that an informative exception is thrown before starting.

Also, there are tools assuming same order of chromosome names (e.g. sequenza), but not all annotation files use the same ordering, e.g. some order lexicographic, some alphanumeric and yet others by length of chromosomes.

HenrikBengtsson commented 9 years ago

So, there was already isCompatibleWith(), which for utilizes internal isCompatibleWithBySeqNames(), e.g.

> isCompatibleWith(fa, bam)
[1] TRUE
HenrikBengtsson commented 9 years ago

Now

> isCompatibleWith(bam, fa)
[1] TRUE

also works.

HenrikBengtsson commented 9 years ago

Added Issue #9 to output information on how targets/sequences are sorted.

HenrikBengtsson commented 9 years ago

Asserting that read and reference uses compatible targets/seqs is very helpful. I found that for instance GATK is doing a similar thing, e.g. "ERROR MESSAGE: Input files reads and reference have incompatible: contigs: Order of contigs differences, which is unsafe." (https://www.biostars.org/p/8212/)