HenrikBengtsson / aroma.seq

🔬 R package: aroma.seq: High-Throughput Sequence Analysis using the Aroma Framework
https://github.com/HenrikBengtsson/aroma.seq
0 stars 1 forks source link

Make FastaReferenceFile, BamDataFile, ... more informative on sequences/targets #9

Closed HenrikBengtsson closed 9 years ago

HenrikBengtsson commented 9 years ago

With some automatic forensic summaries on FASTA sequences, their names and lengths, on can provide a bit more help information (which can be useful for troubleshooting and backtracing etc), e.g.

FastaReferenceFile:
Name: hg19
Tags: no_chr
Full name: hg19,no_chr
Pathname: annotationData/organisms/HomoSapiens/UCSC/hg19/hg19,no_chr.fa
File size: 2.98 GB (3199905630 bytes)
RAM: 0.01 MB
Total sequence length: 3,137,161,264
Number of sequences: 93
Sequence names: [93] 1, 10, 11, ..., Y
Sequence order (on file): by_lexicographic: TRUE, by_mixedsort: FALSE, by_length: FALSE

Same for SAM, BAM, and all other types of data with sequences/chromosomes/targets.

Using the correct "Sequence order (on file)" can be important, cf. Issue #2.

HenrikBengtsson commented 9 years ago

Done.

Examples on BAMs

BamDataFile:
Name: Z00124
Tags:
Full name: Z00124
Pathname: bamData/LG3-test/HomoSapiens/Z00124.bam
File size: 7.32 GB (7863174908 bytes)
RAM: 0.00 MB
Has index file (*.bai): TRUE
Is sorted: TRUE
Number of targets: 93
Total target length: 3.14 Gb (3,137,161,264 bases)
Targets: [93] 1, 10, 11, ..., Y
Ordering of target names (scores): 100% lexicograpic (94.6% mixedsort, 72.8% classical,
 2.2% length)
Number of mapped reads: 140,810,753 (98.9%) out of 142,401,198
Number of unmapped reads:   1,590,445 (1.1%) out of 142,401,198
Generated by: (1) bwa v0.7.12-r1039
BamDataFile:
Name: IX2393_C3ADYACXX_3_GATCAG
Tags:
Full name: IX2393_C3ADYACXX_3_GATCAG
Pathname: bamData/LG3-test/HomoSapiens/IX2393_C3ADYACXX_3_GATCAG.bam
File size: 8.93 GB (9589841168 bytes)
RAM: 0.00 MB
Has index file (*.bai): TRUE
Is sorted: TRUE
Number of targets: 84
Total target length: 3.1 Gb (3,101,804,739 bases)
Targets: [84] 1, 2, 3, ..., GL000192.1
Ordering of target names (scores): 100% classical (69.9% length, 28.9% mixedsort,
 24.1% lexicograpic)
Number of mapped reads: 158,534,993 (96.9%) out of 163,667,372
Number of unmapped reads:   5,132,379 (3.1%) out of 163,667,372
Generated by: (1) 0 v0.5.7

Examples on FASTA

FastaReferenceFile:
Name: Homo_sapiens.GRCh38.dna.chr=1-MT
Tags:
Full name: Homo_sapiens.GRCh38.dna.chr=1-MT
Pathname: annotationData/organisms/HomoSapiens/Homo_sapiens.GRCh38.dna.chr=1-MT.fa
File size: 2.92 GB (3139759277 bytes)
RAM: 0.00 MB
Has index file (*.bai): TRUE
Total sequence length: 3,088,286,401
Number of sequences: 25
Sequence names: [25] 1, 2, 3, ..., MT
Ordering of sequence names (scores): 100% classical (91.7% mixedsort, 75% lexicograpic,
 12.5% length)

Example of GcBaseFile

GcBaseFile:
Name: UCSC
Tags:
Full name: UCSC
Pathname: annotationData/organisms/HomoSapiens/UCSC/hg19/hg19.gc50Base.no_chr.txt.gz
File size: 174.53 MB (183006633 bytes)
RAM: 0.00 MB
Number of sequence contigs: 93
Sequence names: [93] 1, 10, 11, ..., Y
Ordering of sequence contigs (scores): 100% lexicograpic, 94.6% mixedsort, 73.1% classical

Example of GTF

GtfDataFile:
Name: Mus_musculus.GRCm38.79
Tags:
Full name: Mus_musculus.GRCm38.79
Pathname: annotationData/organisms/MusMusculus/Mus_musculus.GRCm38.79.gtf
File size: 638.43 MB (669443027 bytes)
RAM: 0.00 MB
Number of data rows: NA
Columns [NA]: <not reading column names>
Number of text lines: NA
Number of sequence contigs: 1482011
Unique sequence names: [61] 1, 2, X, ..., JH584295.1
Ordering of unique sequence contigs (scores): 75.4% classical, 16.4% mixedsort,
 14.8% lexicograpic
HenrikBengtsson commented 9 years ago

FYI, @drisso. Mixed ordering with roman numerals added to the to-do list.

HenrikBengtsson commented 9 years ago

Now supporting also roman numerals, e.g.

> fa
FastaReferenceFile:
Name: Mus_musculus.GRCm38.dna.chr=1-MT
Tags:
Full name: Mus_musculus.GRCm38.dna.chr=1-MT
Pathname: annotationData/organisms/Mus_musculus/Mus_musculus.GRCm38.dna.chr=1-MT.fa
File size: 2.58 GB (2770964555 bytes)
RAM: 0.00 MB
Has index file (*.bai): TRUE
Number of sequence contigs: 22
Sequence names: [22] 1, 2, 3, ..., MT
Sequence lengths (bp): [22] 195,471,971; 182,113,224; 160,039,680; ...; 16,299
Total sequence length (bp): 2,725,537,669
Ordering of sequence contigs (scores): 100% canonical, 90.9% mixeddecimal,
81.8% lexicographic, 77.3% mixedroman, 9.1% length
> getSeqNames(fa)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "X"  "Y"  "MT"