How to bring in custom annotations (BSgenome, TxDb)?

mirax87 commented 3 years ago

Hi,

thanks for this interesting tool. I am current trying to get ularcirc to run with some of my data.

Unfortunately, the reference genome for alignments don't match the UCSC chromosome naming conventions, so I thought of creating my own BSgenome and TxDb. I already forged the BSgenome, the TxDb is yet to come.

For now, with the BSgenome loaded in to the name space, I tried to find it in the shiny App under Setup configuration. My custom BSgenome was not listed - I could imagine that it would be due to my missing TxDb (yet to be produced).

My question for you: Is it yet possible to bring in custom genome + annotation and if so, how can I achieve that?

best, -Michael

davhum commented 3 years ago

In theory it should be possible to bring in custom genome + annotation. However it will require that an annotation database is available. i.e. Ularcirc first searches for annotation database libraries that is named as follows:

org..eg.db

so for humans this is

org.Hs.eg.db

The two letter code is then used to identify matching genome and transcript data bases.

If an annotation data base library exists for you organism then it sounds like you are very close to having all the required items.

mirax87 commented 3 years ago

What about the BSgenome and TxDB? They seem to be mandatory as well. Also where is the annotation database required to be - it's checking somewhere online, right?

If there is a local installation of the database possible, it would be great, if there was a wrapper, where the user provides the genome fasta, the genome annotation (e.g. gtf) file (and else might be necessary) to bring in custom annotations suitable for ularcirc. Would that be feasible?

davhum commented 3 years ago

Agree have a wrapper is a good idea - but I am unsure of what is involved for some of those files. I have experience in making TxDb from gtf, but have not generated genome or annotation database. You mentioned you had generated genome file, was that easy to do? I suspect the annotation database is the most involved.

Perhaps another solution to your problem is to convert your alignment coordinates to UCSC coordinated. I could make a wrapper for that. If you could generate a small test dataset I could generate a simple method to convert to a format that is compatible with existing databases.

mirax87 commented 3 years ago

I thought about the conversion of alignments - or even remapping - but the downstream effects of the conversion will be to costly for me as I am using more tools for circRNA prediction and quantification (mostly from the CIRI world). Thank you for the offer, though.

Regarding the BSgenome, I think it's not too tricky and believe it can be automated (in a wrapper). The BSgenome has some documentation on the how to forge a new one. In brief, you create sort of a dictionary (seed.dcf), with all relevant BSgenome information and compile it with BSgenome::forgeBSgenomeDataPkg. There are more forums and discussions around that can help be of help. Here is the BSgenome documentation, check for 'How to forge a BSgenome data package'.

https://bioconductor.org/packages/release/bioc/html/BSgenome.html

This is what the seed.dcf file looks in my case, but cannot guarantee that these are the minimum specs:

Package: BSgenome.dm6.ensembl
Title: "dm6 from local repository"
Description: "compatible with snakePipes alignments"
Version: 0.999                                            # random number
organism: Drosophila_melanogaster
common_name: Fruitfly
provider: FlyBase
provider_version: dm6
release_name: dm6
release_date: 2018_03
source_url: <path to fasta directory>
organism_biocview: dm6_ensembl
BSgenomeObjname: dm6_ensembl
seqs_srcdir: <path to fasta directory>
seqfile_name: genome.2bit                                  # genome in 2bit

Genome fasta to 2bit conversion

https://genome.ucsc.edu/goldenPath/help/twoBit.html

VCCRI / Ularcirc

How to bring in custom annotations (BSgenome, TxDb)? #19