Position filtering - Githubissues

csoneson commented 1 month ago

[x] sequence context (done in af66b27676da764d7226f055b3acd2fc91aeb435)
[x] coverage (currently inside the calcReadStats function) (done in 26c7e41)
[x] collapsing/pruning repeated positions (done in fe95936)
[x] remove positions that are NA in all reads in all samples (done in 8ad2d22)
[x] each of these go into a separate helper function
[x] new function filterPositions() - calls one or more of the helper functions (done in b1e79e5)
[x] call read-NA-filtering in the end (by default turned on) to remove reads that are now all NA in the retained positions (done in 10d37e08b488f3368244fd38ad4081da3afaba8d)

csoneson commented 1 month ago

Some things to discuss:

Currently, keeping positions matching sequence context TAG also retains (e.g.) NNN - don't allow this (N in the query are fine, but remove hits where the provided sequence context contains N).
Not totally sure if the current names for these filtering functions are the most intuitive.
Should the coverage filter be applied to the sum across samples or to individual samples (e.g. require at least one sample to pass the coverage threshold)?
For .removeAllNAPositions(), do we allow a summary assay (and if so, check for rowSums > 0 instead of presence of any non-NA values?

mbstadler commented 1 month ago

Just my 2-cents to the questions above:

N-containing sequence contexts: As discussed, Ns in rowData(se)$sequence.context may result both from Ns in the reference as well as from Ns added to sequence contexts near the boundaries that run out of a reference sequence. I agree that we probably don't want to allow this.
The names are fine to me. An alternative would be to make them all similar, e.g. .filterPositionsBy..., but I am not sure that would be worth it, as the current names are clear and they are internal functions usually called via filterPositions().
Coverage filter: I don't really know what makes more sense. In order to get rid of false positive calls, the global coverage (as implemented now) will probably work well. By the way, I find the use of addReadsSummary very elegant :-)
remove NA positions: I would maybe rather implement such a function separately under a different name (.removeAllZeroPositions, and a new argument in filterPositions(..., assay.type.zero)), if we need it.

I have a question regarding the coverage filter, at this line: I guess our assays always have row names, but is it safer/required to go via rownames, or would a logical keep also work and be more general?

csoneson commented 1 month ago

Thanks!

I have a question regarding the coverage filter, at this line: I guess our assays always have row names, but is it safer/required to go via rownames, or would a logical keep also work and be more general?

You're right, we don't need rownames - the two objects (mat and se) should always have the same rows anyway.

remove NA positions: I would maybe rather implement such a function separately under a different name (.removeAllZeroPositions, and a new argument in filterPositions(..., assay.type.zero)), if we need it.

Also makes sense to me

csoneson commented 1 month ago

Removed the rownames matching for subsetting in 2dac82e
Added min.nbr.samples argument to .filterPositionsByCoverage in 2dac82e
Only allow ambiguities in the pattern to be interpreted as wildcards in sequence context filtering in 58618db

csoneson commented 2 weeks ago

Included in #22

fmicompbio / footprintR

Position filtering #16