Adoni5 commented 9 months ago

Feature being added - Filtering!

Filter reads and alignments based on a flexible "mini language" specified in the TOML. Would go into config TOML under the respective place for the filtering? I.e caller_settings/mapper_settings Applied in the tight targets loop

# for base calling
filter = [
  "metadata.sequence_length > 0",
  "metadata.sequence_length < 1000",
]

# for alignment
filter = [
  "is_primary",
  "mapq > 40",
  "strand == -1",
]

This is parsed into magic Enums and Classes in _filter.py

chunks  = read_until_client.get_read_batch(...)
filtered_calls, calls  = partition(basecall_filter, basecall(chunks))
filtered_aligns, aligns = partition(alignment_filter, align(calls))

for result in aligns:
    print("boo these alignments are trash")

for filtered_item in filtered_calls + filtered_aligns:
    print("Woohoo we have great success in filtering")

I suppose we would store these on the respective classes? _PluginModule or something I forget

Ideas

Extend language to startsWith/endsWith
and/or/not logical operators

Issues that need resolving/clarification

VERY footgunny - for example sequence.metadata.length < 0, mapq < 0 and goodbye all reads. How can we safeguard against this? I suggest maybe starting only with PAF, and maybe adding some checks in validation.
Where do we add the tracking of filtering status. Do we add it directly to the Result object, do we add it straight into the plugin basecall/map_reads methods (would involve having to write separate implementations for new plugins), and have the plugin return two Iterables, one of reads/Results instances that passed and one that failed?
What do we do with Results that fail validations, unblock or proceed?
- Fails basecalling filtering
- Fails alignment filtering
DO we add a fails_validation to the toml/Conditions section, which defaults to proceed? This then relies on the exceeded max chunk behaviour
How and where do we log this?
What will it look like/where will it be placed in the config?
What will the API between targets and plug-ins look like?
How will we ensure that targets doesn’t miss any data?

### Tasks
- [ ] #304 
- [ ] Describe mini-language
- [ ] Needs mad tests

mattloose commented 9 months ago

VERY footgunny - for example sequence.metadata.length < 0, mapq < 0 and goodbye all reads. How can we safeguard against this? I suggest maybe starting only with PAF, and maybe adding some checks in validation.

For this, I suggest we define acceptable filters for each component and disallow footguns where we can think of them. We will need good documentation for each filter.

mattloose commented 9 months ago

Where do we add the tracking of filtering status. Do we add it directly to the Result object, do we add it straight into the plugin basecall/map_reads methods (would involve having to write separate implementations for new plugins), and have the plugin return two Iterables, one of reads/Results instances that passed and one that failed?

I think the idea is that we have a flag on the results object that tracks if the result has failed a filter? We might need to think how we track the failed filters?

mattloose commented 9 months ago

What do we do with Results that fail validations, unblock or proceed? Fails basecalling filtering Fails alignment filtering

This is complex.

In my view each filter line should be independent - i.e if you have:

# for alignment
filter = [
  "is_primary",
  "mapq > 40",
  "strand == -1",
]

failing on any one of these conditions would result in the alignment being nulled. In this state the molecule would be treated as a no_map and sequencing would proceed as specified in the toml for a no_map read.

One can also imagine the posibility of wanting to take a specific action when a filter is evaluated to false:

# for alignment
filter = [
   "mapq >40:stop_receiving,unblock",
   "strand == -1",
]

In this case, a read which had a mapq > 40 would stop_receiving if true and unblock if false. If the strand was +1 it would be treated as no_map.

This is a fairly complex filtering language and would need to be well documented with some use cases.

Note this would also address:

DO we add a fails_validation to the toml/Conditions section, which defaults to proceed? This then relies on the exceeded max chunk behaviour

mattloose commented 9 months ago

How and where do we log this?

Perhaps this should go in the chunk log (as was) with a tag concept ala sam files?

mattloose commented 9 months ago

What will it look like/where will it be placed in the config?

I think this will have to be a property of a region? I can envision scenarios with barcoded samples where you want different filters depedning on sample type.

mattloose commented 9 months ago

What will the API between targets and plug-ins look like?

In my view available filters for a plugin should be defined by the plugin and they should be whitelisted by the plugin. And filter passed to a plugin that is not explicitly defined by the plugin should be ignored. On validation of a toml any such filters should be alerted to the user.

In this case, a plugin can be valid with no defined filters. Therefore filter availability is optional.

mattloose commented 9 months ago

How will we ensure that targets doesn’t miss any data?

I don't quite understand this - I think targets will see every result regardless of filtering if we follow the above suggestions.

mattloose commented 9 months ago

I think any read which has been filtered out should be labelled as "filtered" in the chunks log. This is different to an action_overridden (or whatever). In the context of action overridden a decision was made about a specific read which matches the toml but has been changed for a reason outside the toml spec (e.g first read or duplex).

If a read has been filtered, the behaviour is specified by the toml so readfish has not overridden what the toml asked for - it has done as it should have done.

Adoni5 commented 9 months ago

After chatting with @alexomics on Friday - we came to a couple of conclusions, decided to write them up here after a few days to give me time to forget some of them.

The concept of filtering in it's initial form will do two things:
- If a Base-called read doesn't meet a filtering threshold, it should be excluded from alignment. This will mean that filtering can increase the speed of the loop.
- If an alignment fails filtering it should be excluded from decision making. This is NOT the same as overriding an action.
For alignments the filtering step should be applied for each alignment in the case of multiple alignments. If ANY alignment passes, that alignment moves on.
We shall track the filtering status on the Results object, with a bool.

We perform the filtering outside of the Plugin but inside the loop. This would mean the targets.py workflow would look something like

signal -> basecalling(signal) -> filter_basecalls(calls) -> align(filtered_calls) ->
                                                filter_alignments(alignments) -> decide(filtered_alignments)

No Guardrails - too much work and rigidity. If you want to shoot yourself in the foot go ahead
No region based filtering at this point.
Add a new filters table to the config TOML. This would then have alignment and basecalling sections. This would be accessed as config.filters in the code base.

Think that's it for now!

mattloose commented 9 months ago

I agree with most of the above - 1 -3 is all good.

4 - I'm not convinced about where filtering should be performed? Could you expand on the reasoning for filtering in targets rather than the plugin?

With respect to 1-3 I broadly agree but I wonder if there should be some suggestions or guidance about foot guns?

Adoni5 commented 9 months ago

With respect to 1-3 I broadly agree but I wonder if there should be some suggestions or guidance about foot guns?

I think we should make it clear that this is experimental and that no verification on the actual filtering conditions will be performed, and state clearly that if a user doesn't double check their filtering statements, they could ruin their experiment, much in the same way we disclaim using readfish at all!

If we filter in targets, rather than plugin it means we don't have to:

Find a way to access the filters in the Plugins.
We ensure we don't miss any read chunks as they go through the pipeline - and don't have to work out whether we return separate iterables of Results etc. That way we can easily ensure they show up in the debug_log
We can keep it a separate module - as it's experimental

One thing it would be very useful to do now is to come up with some Key, useful use cases that we can use as tests during development!

That way we can not lose sight of why we are doing this. Do you have any ideas @mattloose, @alexomics ?

I'm thinking:

mapq > 40 - Reason: only procede on confident mappings
sequence.metadata.len > 300 - Reason: Confident of no edge effects

But I can't think of any others off the top of my head

mattloose commented 9 months ago

I think we should make it clear that this is experimental and that no verification on the actual filtering conditions will be performed, and state clearly that if a user doesn't double check their filtering statements, they could ruin their experiment, much in the same way we disclaim using readfish at all!

I do agree - but I thought we had considered adding a whitelist for acceptable filters? However - all this is definitiely at the users own risk!

I think your examples are good - I will have a think about more of them....

Adoni5 commented 9 months ago

The reason I don't want to add a whitelist of acceptable filters is that we will end up having to maintain that list and edit it. If nanopore changes the layout of the base-call results returned from dorado v4.3, filtering base-calls will work for v4.2 but not v4.3 and we will then have to manually edit our white list. But then do we have two sets? One for v4.3 and one for V4.2?

Whilst letting anyone filter anything does have the added danger of people making mistakes or doing something silly, I think that the amount of work that comes with a manual white list simply isn't worth it

mattloose commented 9 months ago

Suggested filters:

[ ] mapq
[ ] primary/secondary?
[ ] sequence length
[ ] sequence quality (only analyse reads whose basecalling quality is above a certain threshold?)
[ ]

mattloose commented 9 months ago

I think we should consider having filters at the region level - if there was a filter for a region you ignore the global filter. It there isn't a region filter you apply the global filter.

THis would allow testing of filter types on a single flowcell through different regions - e.g. mapQual >0, >20, > 40 all on one flowcell.

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

LooseLab / readfish

Filtering implementation epic #305

Feature being added - Filtering!

Ideas

Issues that need resolving/clarification