DerKevinRiehl / transposon_classifier_rfsb

Transposon classification tools RFSB, part of TransposonUltimate
GNU General Public License v3.0
13 stars 4 forks source link

Best way to filter unclassified repeats #1

Closed chrisjackson-pellicle closed 2 years ago

chrisjackson-pellicle commented 3 years ago

Hi Kevin - thanks for making this software.

My understanding is that the RFSB classifier will bin every sequence in a provided fasta file into one of the categories in the classification taxonomy you've chosen. That is, it's not possible for a sequence to be 'unclassified'. Is that the case?

If so, does it make sense to use the probabilities in the outputPredictionFile to filter for 'unclassified' sequences? For example, if the probability was < 0.5 for both Class I and Class II (i.e. columns 3 and 11) for a given sequence, to me that intuitively means that the classification confidence is low, and I'd prefer to leave it as 'unclassified'. I'm not sure how the probabilities are calculated, though, so I'm not sure if they can be used in this fashion.

Any advice much appreciated!

Cheers,

Chris

DerKevinRiehl commented 3 years ago

Good Morning Chris, thank you very much for your interest in RFSB and your very good question.

Your understanding of RFSB is correct. So RFSB classifies nucleotide sequences into the hierarchical taxonomic scheme, that are considered to be Transposons (Thats the big assumption). We also worked on the question whether a nucleotide sequence (or an excerpt) is a transposon or not before, but this is a more challenging question and was not included into RFSB as it is published now.

RFSB is based on a "local node per classifier" approach that we call "binary model structure", meaning that at each class of the taxonomy there is a binary classifier answering the question whether the given sequence is related to that class or not. At the beginning, RFSB will ask the classifiers for "1" and "2", will compare their probabilities and chose the most probable one. Afterwards RFSB will ask the subclassifiers (e.g. if "1" was more probable) "1/1", "1/2" and so on. So exactly like you interpret, the numbers represent the probabilities.

Internally, we use a threshold of 0.0, meaning we will just chose the most probable of the classifiers at a given decision stage. However, as you suggest, the threshold could be a good way to add RFSB with this functionality. So in your example Class I and II (the right columns you mention), lets say Class I and II already have a very low probability (lets say both around 10%). If we now set the threshold to 0.5, then none of them would be activated and the sequence would remain unclassified.

This approach however can bring additional complexity into the game. What is if at stage one the sequence pass, but at stage 2 (level 1/1 or 1/2) it would not pass due to the threshold, how should we decide then? Should we use another threshold at each stage? I think this example makes clear, that using a threshold can add additional complexity into the game.

Another challenge is to get a set of sequences that are definetly not transposon sequences for a classifier that can decide whether there is a transposon or not in the given sequence. This is also quite a challenge, and imagine there would be some part of the transposon inside the given sequence (lets say the left 30%), how would you decide then and so on.

I think setting a threshold at the first stage is a valid criterion for you to make sure RFSB is very sure about the choice. However, I did not benchmark it and am not able to provide you with the most accurate threshold. Do you think it would be a nice feature to let the user set the threshold as parameter for the classification (at least at stage 1 decisions) for mode 1 (classify) ?

How do you think about this :-) ? Best regards, Kevin