Closed AlexTate closed 2 years ago
Update 10/11:
We are removing the requirement for stranded features. Currently strand is mapped from True/False to +/-. Now, if a feature's strand is anything but +/-, it is mapped to None. The GFFValidator produces a warning about this but no longer treats it as a hard error. Per Tai, a strand type of None matches strand selectors for "sense", "antisense", and "both," and 5'/3' anchored selectors can also evaluate these features, but evaluation does not distinguish between 5' and 3' ends.
Update 10/12: introducing anchored
as an overlap selector. If nothing else, this should help explain the behavior of 5'/3' anchored selectors with unstranded features. Documentation has been updated to reflect the changes described above.
Tested on ram and At data with parent attribute only and with without strand info.
GFF validation now takes place at the start of end-to-end pipeline runs, and at the start of tiny-count when it is called as a standalone step. This PR also introduces support for unstranded features which are represented internally with the value None, as well as a new overlap selector:
anchored
Strandedness and an appropriate ID attribute are checked for on a per-feature basis. After parsing all GFF files, the total set of chromosome identifiers is checked against the user's sequence files. In order of priority, sequence files include bowtie indexes, reference genomes, and alignment SAM files. The first two options can state with certainty that there isn't chromosome overlap between GFF and sequence files, and if that is determined to be the case, an error is issued and the script quits. The third option uses a 50,000 line sample from each SAM file as a heuristic, and if it fails to find chromosome overlap, a warning is issued and the script continues with normal execution. A warning is issued for unstranded features before proceeding with counting.
Parent
is now accepted as an ID attribute ifID
andgene_id
are missing. Features describing entire chromosomes are also skipped; Ensembl supplies these in their gff3 files and, in addition to not being useful as a selection target in tiny-count, they also lack strand information so they were throwing errors.Closes #235