Closed Ben7124 closed 2 years ago
Hey @Ben7124, thank you for your question.
We offer human HG38 files (including annotations and bowtie indexes) for download on the lab's website, which includes a lightly modified GFF3 file from miRBase v22. This annotation file is already compatible with tinyRNA; the differences from the original are explained in the header.
These annotation files will require some editing in order to make discontinuous feature definitions compatible with tinyRNA (see our requirements for reference annotations). Currently, we require that each feature defines its ID with an ID=
column 9 attribute. Here are the types (column 3) of features missing an ID=
, and their abundance, in GRCh38.106.chr.gff3:
Feature Type | Count |
---|---|
biological_region | 180084 |
exon | 1572331 |
five_prime_UTR | 168139 |
three_prime_UTR | 195144 |
Among these feature types, the last 3 define a Parent=
but do not define an ID=
. We avoid defaulting the ID to the Parent's because attributes and intervals of discontinuous features are merged with the root Parent before evaluation by your selection rules. Merging intervals of 5'/3' UTRs with your features of interest would likely not make sense for your analysis.
Additionally, if your annotation file contains track lines, they will need to be removed before you are able to use them with tinyRNA.
As far as I can tell there shouldn't be any issues with this source
@Ben7124 I'd also like to add that for Ensembl or any other annotation source that you might want to filter by feature type (GFF column 3), it might prove easier to use tiny-count
's type filters rather than editing your source GFF files. The type filters are inclusive so features will only be considered for selection if their type matches one of the types you define, and an empty list allows all types. Type filter values can be assigned in the Run Config under the counter_type_filter
key.
You could use the following to obtain a complete list of the unique types defined in your GFF file:
grep -v '^#' YOUR-FILE.gff3 | cut -f3 | sort | uniq
Edit 8/29: if your GFF annotations aren't sorted by type, the output of cut needs to be piped to sort before being passed to uniq
Closing issue due to inactivity
Hello,
For the human genome, is the required annotations file just the one from Ensembl or Refseq or should it be from mirBase/similar database?
Thank you.