HenrikBengtsson / aroma.seq

🔬 R package: aroma.seq: High-Throughput Sequence Analysis using the Aroma Framework
https://github.com/HenrikBengtsson/aroma.seq
0 stars 1 forks source link

Add FastaReferenceIndexFile (*.fai) #6

Closed HenrikBengtsson closed 9 years ago

HenrikBengtsson commented 9 years ago

Add FastaReferenceIndexFile class for *.fai FASTA index files. They are short tabular text files, e.g.

$ cat annotationData/organisms/HomoSapiens/Homo_sapiens.GRCh37.73.dna.fa.fai
1       249250621       56      60      61
2       243199373       253404911       60      61
3       198022430       500657663       60      61
4       191154276       701980523       60      61
5       180915260       896320760       60      61
6       171115067       1080251331      60      61
7       159138663       1254218372      60      61
8       146364022       1416009403      60      61
9       141213431       1564812882      60      61
10      135534747       1708379929      60      61
11      135006516       1846173647      60      61
12      133851895       1983430330      60      61
13      115169878       2119513148      60      61
14      107349540       2236602582      60      61
15      102531392       2345741339      60      61
16      90354753        2449981645      60      61
17      81195210        2541842368      60      61
18      78077248        2624390889      60      61
19      59128983        2703769482      60      61
20      63025520        2763884006      60      61
21      48129895        2827960009      60      61
22      51304566        2876892126      60      61
X       155270560       2929051825      60      61
Y       59373566        3086910289      60      61
MT      16569   3147273469      60      61

I don't know of a formal reference for the file format, but the columns appears to be (the column names are mine):

  1. sequence: the name of the sequence
  2. length: the length of the sequence
  3. fileOffset: the offset of the first base in the FASTA file
  4. lengthPerEntry: the number of bases in each FASTA line
  5. bytesPerEntry: the number of bytes in each FASTA line
HenrikBengtsson commented 9 years ago

Another reference is http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference, which says: "a text file with one record per line for each of the fasta contigs. Each record is of the: contig, size, location, basesPerLine, bytesPerLine"

HenrikBengtsson commented 9 years ago

Note: *.fai files only works on non-compressed FASTA files.