Filtering out placements

Robaina / MetaTag

metaTag: functional and taxonomical annotation of metagenomes through phylogenetic tree placement

https://robaina.github.io/MetaTag/

Apache License 2.0

1 stars 0 forks source link

Filtering out placements #77

Closed Robaina closed 2 years ago

Robaina commented 2 years ago

We need a way to deal with queries that are placed in different clusters. Currently, we remove those sequences. However, we are losing valuable information in some cases. Particularly, when two queries are placed in different clusters that contain the same "good" function.

In these cases, we would like to preserve these placements. We could take the placement that gets the highest LWR. And, as before, remove queries that were placed in clusters with different functions

micronuria commented 2 years ago

Hi, As I need to check how to filter the placements using the LWR values and this issue is related to that, I am going to think about an algorithm to implement a function to handle the LWRs considering all aspects and issues related to LWR (and see what functions and options in Gappa we can use).

So, the problems we need to solve and that I have identified so far are:

deal with queries placed in different clusters that we do not want to remove (what Robaina just said).
deal with queries with low maximum LWRs

Any other problem that you remember?

micronuria commented 2 years ago

And with you I mean the full team... :stuck_out_tongue_closed_eyes:

Robaina commented 2 years ago

For the record, the current filter for placements in more than one cluster is this one:

https://github.com/Robaina/TRAITS/blob/9b4f8d4580a10a1f8e68b42f758ae03e1689f270/code/phyloplacement/placement.py#L618-L628

It's removing all placements that hit more than one cluster

Robaina commented 2 years ago

And here, some thoughts:

From the functional labeling point of view there are only two clusters: one cluster containing ref seqs with the correct function and another cluster containing ref seqs with another function (e.g., paralogs). Hence, regarding function assignment, only queries that are placed in both these clusters are problematic.
From the taxonomical point of view: we don't use clusters to define taxonomy, instead we rely on tree topology and the taxopaths of the reference sequences. However, if a query hits more than one cluster, it is likely that its placements will be assigned different taxonomies.
Do we really want to keep placed queries that hit more than one cluster? This means that these placements have high uncertainty. We could check the fraction of queries that hit more than one cluster for each case and make a decision based on this value (in my case studies the fraction was small)

JuanRivasSantisteban commented 2 years ago

Regarding your 3º thought, i would say that if a query is sufficiently ambiguous so that in your reference tree it resembles two topologically distinct clusters, and we have no arguments other than the topological one to assign it a cluster, so we should discard it.

But it is also possible that it is not a fault of the query and it may be a consequence of 1) a definition of clusters too close to the outermost node, i.e. making too many clusters or 2) that there is not enough sequence diversity in your reference tree.

Robaina commented 2 years ago

Alright, so removing queries placed in different clusters for now