denalitherapeutics / archs4

An R interface to query and extract data from the ARCHS4 data
10 stars 1 forks source link

Create new gene-level metadata files from full ensembl annotations #1

Closed lianos closed 6 years ago

lianos commented 6 years ago

We store the gene (and soon transcript) level augmented feature information in the same directory that the data is stored in (getOption("archs4.datadir")).

Currently the gene-level metadata was just copied from the one in the GenomicsStudyDb package, but those data were generated based off of the GENCODE-basic annotations, but we probably want to recreate these from the full ensembl transcript files.

We should be able to create an arsh4-specific feature table by first parsing the ensembl transcript identifiers from the transcript-level hdf5 files. Then roll them up to ensembl gene id's with their associated gene symbol, then map those ensembl-derived gene symbols to the organism_matrix.h5 gene-level count files.

lianos commented 6 years ago

Let's call this function create_augmented_gene_info and simply run it over a datadir