Closed mumichae closed 3 years ago
Hi, thanks for the question. The labels mean where the location of the sequence comes from. It can be either 'alignment' meaning we have aligned the sequence into the genome or 'expert-database' meaning that an expert database provided the location of the sequence in the genome. We only align sequences if we have not been given a location by one of our expert databases because not all of the databases that provide sequences also provide a location. For example, ENA provides many sequences but no locations for them, while other resources like gtrnadb always provide locations for their sequences.
piRBase is a slightly odd example to look at. We only import a subset of their data due to the amount of sequences they have and while they do provide coordinates it isn't for an assembly we know about, though it can be converted to one.
I agree the naming isn't really clear and we should probably have a help page somewhere that says what these values mean. Hopefully, this clarifies things.
Thanks, this was helpful!
The human GTF contains an attribute called 'source', which for each gene has the value 'alignment' or 'expert-database'. I haven't found any clear explanation what each of these values mean, i.e. are 'alignment' genes less curated than 'expert-database'?
For example, the piRNA class has 142620 transcripts, but none of them are from an 'expert-database' although they all come from PirBase. Is PirBase not considered an expert database in this case? If so, what determines a database/transrcript to be 'expert' or not?