HenrikBengtsson / aroma.seq

🔬 R package: aroma.seq: High-Throughput Sequence Analysis using the Aroma Framework
https://github.com/HenrikBengtsson/aroma.seq
0 stars 1 forks source link

Is genome assembly important enough to become first class citizen? #20

Open HenrikBengtsson opened 9 years ago

HenrikBengtsson commented 9 years ago

When it comes to annotation data, in addition to a formal organism label, should the genome assembly label becomes a first class citizen, e.g.

annotationData/organisms/Homo_sapiens/GRCh36/
annotationData/organisms/Homo_sapiens/GRCh37/
annotationData/organisms/Homo_sapiens/GRCh38/
annotationData/organisms/Mus_musculus/GRCm37/
annotationData/organisms/Mus_musculus/GRCm38/

? Then one could lookup annotation data as:

fa <- FastaReferenceFile(organism="Homo_sapiens", assembly="GRCh38")

Note that it should be allowed to have tags in assembly directory names, e.g.

annotationData/organisms/Homo_sapiens/GRCh37,hg19/

and still have the above lookup find it. It's only the GRC label that needs to be unique.

It might be that one has multiple sub alternatives, e.g.

annotationData/organisms/Homo_sapiens/GRCh37,hg19/Ensembl/71/
annotationData/organisms/Homo_sapiens/GRCh37,hg19/Ensembl/75/

Then the following request is ambigous (unless one defines some unique ordering and picks the "most recent" one:

gtf <- GtfDataFile(organism="Homo_sapiens", assembly="GRCh37")
# Or equivalently
gtf <- GtfDataFile(organism=organism(fa), assembly=assembly(fa))
# Or short
gtf <- GtfDataFile(organism=fa)

To specify Ensembl release 71, then one could use:

gtf <- GtfDataFile(organism="Homo_sapiens", assembly="GRCh37", sub=c("Ensembl", "75"))
# Or equivalently
gtf <- GtfDataFile(organism=fa, sub=c("Ensembl", "75"))