BackofenLab / RNAProt

Modelling RBP binding preferences to predict RPB binding sites
MIT License
9 stars 4 forks source link

Interpretation of prediction results #7

Open dbogdano opened 11 months ago

dbogdano commented 11 months ago

Hello,

Thank you for creating this tool, it's been very easy to get using following the documentation. Would you be able to write a bit on how best to interpret and use the prediction results? In particular, how the window score p-value is calculated and if it should be used to filter for significant predictions?

I'm currently running the RNAProt prediction method on some intron sequences using a training based on the Encode RBFOX2 CLIP-seq data. Looking at the resulting bed and .tsv files, there are a number of identified peaks containing the RBFOX sequence motif, which also fall under other annotations used in the training (intron/exon, secondary structure, etc.), yet only a small fraction of these have a window score p-value below 0.1.

Let me know if I can provide any more information about my use case.

Thanks again, Derek

michauhl commented 11 months ago

Hi Derek,

thanks for your interest in our tool!

Regarding the p-values: these are calculated based on the empirical score distribution found in the positive training set (simply using the resulting ECDF). So a p-value of 0.05 would translate into a site having a score which is equal to or higher than 95% of the positive training sites. Assuming that higher model score == more authentic binding site, filtering by small p-values would be a way to get more authentic sites. I'm guessing you used rnaprot predict with --mode 2 (i.e. sliding window predictions). Here by default, only windows (as well as the peak regions inside these windows) are reported which have a score >= 50% of positive training sites. So the p-value can be a bit misleading here, as sites with p-values > 0.1 can still be confident sites (as all the reported windows still score >= 50% of positive training sites).

As for your predictions, I’m not quite sure yet about the details. Did you use only sequence information to train the model, or additional features too? And what inputs did you use for predictions? Sequences (only works with sequence model), or genomic or transcript regions? The trained model basically determines what type of input to use for predictions, but I understand that this can be confusing (too much going on) :)

Best, Michael

dbogdano commented 11 months ago

Hi Michael,

Thanks for the quick response, your clarification of the p-value certainly helps.

I've been trying different combinations of features for the training model. Generally the more features I add, the higher the number of significant predictions, which makes sense. After RNAProt predict, when looking at the top window profiles plots, there are definitely some windows containing visually strong motifs that aren't passing standard p-value cutoffs. I suppose inclusion of these as significant just requires a more precise set of training features and perhaps better filtering of the training data.

Currently I'm working with bed files for both the training data and prediction data, so I'm using the Genomic features input-type for prediction. The input data for prediction are bed files containing ~100-500 intron regions flanking differentially spliced exons specific to different cell types. Ideally I'd like to predict the sets of RBPs responsible for the cell type specific splicing, based on the intronic sequence content. One feature that I think may help here would be annotations of intron/exon borders, or a within-intron, proximity to alternatively spliced exons score. I assume the transcript input-type feature tra-borders is something similar to this? If so, in your experience how does the tra-borders feature help with predictions?

Thanks again for the help, Derek

michauhl commented 11 months ago

Hi Derek,

yes in general the training data presented to the model can make a big difference. As for the positives, we usually don't have much options here. But the selection of negatives definitely influences the model abilities as well. E.g. if you add exon-intron annotations, and your RBP binds mostly to exonic regions, whereas the negatives (which are by default randomly sampled from gene bodies containing RBPs) are mainly from intronic regions. This will likely result in the model putting too much emphasis on exon binding, and not enough on the sequence specifics of the protein binding. Such a model would not be very useful, especially if predicting only on exonic sequences. This in general is an interesting problem (how to best select negatives), but for RNAProt we did not really delve into this. RNAProt offers some options for negatives selection, e.g. these two:

  --mask-bed str        Additional BED regions file (6-column format) for
                        masking negatives (e.g. all positive RBP CLIP sites)
  --neg-in str          Negative genomic or transcript sites in BED (6-column
                        format) or FASTA format (unique IDs required). Use
                        with --in BED/FASTA. If not set, negatives are
                        generated by shuffling --in sequences (if --in FASTA)
                        or random selection of genomic or transcript sites (if
                        --in BED)

but we did not investigate this any more. Naturally, any bias (either in positives or negatives) introduced by additional genomic features could be picked by the model. E.g. if the positives are close to exon borders and the feature is enabled, the model might put too much focus on the border and less on the sequence specifics if the negatives do not contain these borders.

Regarding your prediction task: So (if I understood correctly) you want to identify which RBPs (splicing factors) in principle bind to these intronic regions, and then look which ones are expressed in your condition to identify candidates. I assume the sequence-only predictions were not helpful?

Regarding the additional features: The --tra-borders feature only works with transcript regions as input. Enabling this means labelling the exon borders on the spliced transcript sequence. The --eia-ib option together with --eib on the other hand would label the intron borders (so two labels, one for intron 5' end, one for 3' end). This could help with predictions, assuming that the distance or positioning (5' or 3' end of intron) of the RBP is somehow informative in the training data. But as mentioned above, the choice of negatives again will likely influence whether something useful is learned or if too much emphasis is put on this additional feature (and not the sequence specifics). Also, it might help to supply the transcripts used for --eib labelling (through --tr-list), since what transcripts are present differs for cell types and by default the most prominent transcript isoforms for each gene are used (not ideal but a starting point).

Hope this helps, let me know if anything else is unclear.

Best, Michael :)