MontgomeryLab / tinyRNA

tinyRNA provides an all-in-one solution for precision analysis of sRNA-seq data. At the core of tinyRNA is a highly flexible counting utility, tiny-count, that allows for hierarchical assignment of reads to features based on positional information, extent of feature overlap, 5’ nucleotide, length, and strandedness.
GNU General Public License v3.0
1 stars 1 forks source link

tiny-count: GFF validation and reliability improvements #236

Closed AlexTate closed 2 years ago

AlexTate commented 2 years ago

GFF validation now takes place at the start of end-to-end pipeline runs, and at the start of tiny-count when it is called as a standalone step. This PR also introduces support for unstranded features which are represented internally with the value None, as well as a new overlap selector: anchored

Strandedness and an appropriate ID attribute are checked for on a per-feature basis. After parsing all GFF files, the total set of chromosome identifiers is checked against the user's sequence files. In order of priority, sequence files include bowtie indexes, reference genomes, and alignment SAM files. The first two options can state with certainty that there isn't chromosome overlap between GFF and sequence files, and if that is determined to be the case, an error is issued and the script quits. The third option uses a 50,000 line sample from each SAM file as a heuristic, and if it fails to find chromosome overlap, a warning is issued and the script continues with normal execution. A warning is issued for unstranded features before proceeding with counting.

Parent is now accepted as an ID attribute if ID and gene_id are missing. Features describing entire chromosomes are also skipped; Ensembl supplies these in their gff3 files and, in addition to not being useful as a selection target in tiny-count, they also lack strand information so they were throwing errors.

Closes #235

AlexTate commented 2 years ago

Update 10/11:

We are removing the requirement for stranded features. Currently strand is mapped from True/False to +/-. Now, if a feature's strand is anything but +/-, it is mapped to None. The GFFValidator produces a warning about this but no longer treats it as a hard error. Per Tai, a strand type of None matches strand selectors for "sense", "antisense", and "both," and 5'/3' anchored selectors can also evaluate these features, but evaluation does not distinguish between 5' and 3' ends.

AlexTate commented 2 years ago

Update 10/12: introducing anchored as an overlap selector. If nothing else, this should help explain the behavior of 5'/3' anchored selectors with unstranded features. Documentation has been updated to reflect the changes described above.

taimontgomery commented 2 years ago

Tested on ram and At data with parent attribute only and with without strand info.