HenrikBengtsson / aroma.seq

🔬 R package: aroma.seq: High-Throughput Sequence Analysis using the Aroma Framework
https://github.com/HenrikBengtsson/aroma.seq
0 stars 1 forks source link

FASTA stats: Report on length and MD5 checksum per sequence #38

Open HenrikBengtsson opened 8 years ago

HenrikBengtsson commented 8 years ago

Similar to a FASTA index file that reports sequence name and sequence length per sequence, create a enhanced FASTA summary format that also reports on the MD5 checksum per sequence. It should do so on uppercase nucleotide letters.

This will make it easier to compare FASTA file.

Example

name    length          checksum
chr1    249250621       4b5630ee914e848e8d07221556b0a2fb
chr2    243199373       c01f179e4b57ab8bd9de309e6d576c48
chr3    198022430       11946e7a3ed5e1776e81c0f0ecd383d0
chr4    191154276       234a2a5581872457b9fe1187d1616b13
chr5    180915260       dd4ad37ee474732a009111e3456e7ed7
chr6    171115067       25e6a154090e35101d7678d6f034353a
chr7    159138663       7c7124efff5c7039a1b1e7cba65c5379
chr8    146364022       9d08099943f8627959cfb8ecee0d2f5d
chr9    141213431       8eaca7c9b35d05ab15c9125bc92372fa
chr10   135534747       71db8a6cad03244e6e50f0ad8bc95a65
chr11   135006516       8f3571abef23f6aca0f7b8666a74e7e0
chr12   133851895       fa5a4df7ac0f9782037da890557fd8b8
chr13   115169878       8ae1ac7bdf62dca7c19b427a9153445c
chr14   107349540       06cd248dd1409b804444bd9ad5533d1d
chr15   102531392       e03a89536262b6a0e2beabd90a841c43
chr16   90354753        7eeda5fe3e5d82c2168536f9459170dd
chr17   81195210        20e70f4a08bdc6a54e53ad0a7d1498b6
chr18   78077248        2a099397e2d2dd0f2a2e5a5b4234867d
chr19   59128983        e6e598642c5fbbfb7d922dbfcec86ed8
chr20   63025520        be3c152f6f6bcd5f85f9e4cba49b1e48
chr21   48129895        40755f30599581bfb1186f077db8f580
chr22   51304566        b225fb1495dbb5e5dfb3e327dceb7ab2
chrX    155270560       1b0ed73227e2e7826da63b2b356975e0
chrY    59373566        f92eebebcfea9ebd99a68de2cb409133
chrM    16571       5462ea21bef8d27d5a0ea4da35939549
HenrikBengtsson commented 7 years ago

For the record, I brought this up on Bioconductor support site in March 2016; https://support.bioconductor.org/p/79456/ where Herve replied saying it was a useful idea and that it would make sense to add this to Biostrings.