Annotations file - Githubissues

Ben7124 commented 2 years ago

Hello,

For the human genome, is the required annotations file just the one from Ensembl or Refseq or should it be from mirBase/similar database?

Thank you.

AlexTate commented 2 years ago

Hey @Ben7124, thank you for your question.

miRBase

We offer human HG38 files (including annotations and bowtie indexes) for download on the lab's website, which includes a lightly modified GFF3 file from miRBase v22. This annotation file is already compatible with tinyRNA; the differences from the original are explained in the header.

Ensembl

These annotation files will require some editing in order to make discontinuous feature definitions compatible with tinyRNA (see our requirements for reference annotations). Currently, we require that each feature defines its ID with an ID= column 9 attribute. Here are the types (column 3) of features missing an ID=, and their abundance, in GRCh38.106.chr.gff3:

Feature Type	Count
biological_region	180084
exon	1572331
five_prime_UTR	168139
three_prime_UTR	195144

Among these feature types, the last 3 define a Parent= but do not define an ID=. We avoid defaulting the ID to the Parent's because attributes and intervals of discontinuous features are merged with the root Parent before evaluation by your selection rules. Merging intervals of 5'/3' UTRs with your features of interest would likely not make sense for your analysis.

Additionally, if your annotation file contains track lines, they will need to be removed before you are able to use them with tinyRNA.

RefSeq

As far as I can tell there shouldn't be any issues with this source

AlexTate commented 2 years ago

@Ben7124 I'd also like to add that for Ensembl or any other annotation source that you might want to filter by feature type (GFF column 3), it might prove easier to use tiny-count's type filters rather than editing your source GFF files. The type filters are inclusive so features will only be considered for selection if their type matches one of the types you define, and an empty list allows all types. Type filter values can be assigned in the Run Config under the counter_type_filter key.

You could use the following to obtain a complete list of the unique types defined in your GFF file:

grep -v '^#' YOUR-FILE.gff3 | cut -f3 | sort | uniq

Edit 8/29: if your GFF annotations aren't sorted by type, the output of cut needs to be piped to sort before being passed to uniq

AlexTate commented 2 years ago

Closing issue due to inactivity

MontgomeryLab / tinyRNA

Annotations file #211

miRBase

Ensembl

RefSeq