Open UCSFJL opened 2 months ago
Hi @UCSFJL,
Sorry for the confusion, it's actually meant to be simple, firstly as you alluded to, this is merely for generating an annotated dataframe after running RNA-SPRINT so you can easily upload to Morpheus for visualization. It has nothing to do with the inference step, and there's no requirement for number of columns or type of columns, you can add whichever annotation you deem important (or whatever you'd like to visualize), the condition
and burden
were just meant to serve as an illustration to show you can do both categorical data and continuous data.
I am pretty sure you can just do something like below and it should work just fine, but let me know if you run into any problem and I can further assist.
sample dummy_col
sample1 dummy
sample2 dummy
sample3 dummy
Best, Frank
Thank you very much for your prompt response! I completed the analysis as you described and the result looks promising! I am having a little difficulty interpreting the output number, I noticed that the MDT predicted RBP activity was given a score between 0-1 in contrast to their benchmark activity, where higher RBP activity correspond to a higher score. But how was this data normalized? Is the score linear normalized or a probability of confidence, for example does a score of 0.8 indicate a twofold activity compare to a activity score of 0.4? I would really appreciate some insights into how may I interpret the numbers for the convenience of downstream analysis, and thank you very much for your help!
Hi @UCSFJL,
Sorry please correct me if I misunderstood, but I am not sure what "the MDT predicted RBP activity was given a score between 0-1" mean, for instance, see following screenshot, the estimate should not be bound by 0-1 right?
So we are strictly following the MDT implementation from decoupler-py package, the source code for MDT is here (https://github.com/saezlab/decoupler-py/blob/main/decoupler/method_mdt.py). The idea is the prior network serve as the X
and the PSI value for each splicing event is the Y
. After fitting a random forest regression model, the feature importance for each predictor (aka splicing factor) can be inferred based on impurity based importance measurement (https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#feature-importance-based-on-mean-decrease-in-impurity), which serve as the "activity" in the final output.
So It is a bit hard to say whether they are linear normalized or not, but more of a heuristic. In terms of confidence, I don't think any scikit-learn implemented function will provide confidence (not to say we can not derive one).
The benchmark I presented in the paper is largely based on rank, so based on that, I would say I'd interpret in a more qualitative way, if a splicing factor's activity is consistently higher in tumor than normal, then I'd believe it has its role in tumorigenesis. I think the absolute values should also be informative, but since I didn't conduct any benchmark on that, I don't have scientific evidence to suggest whether 0.8 is exactly two fold more active than 0.4.
Hoping this helps a bit, Frank
That is very interesting, and thank you for the insightful comments regarding the values of the output. Due to some reason, the value of my SPRINT outputs are consistently between 0-1, and the row and column are reversed compare to your example (in my case, RBP:event location as columns and samples as rows), not sure if these differences will influence the analysis but I would love to eliminate the differences. Due to issues with downloading altanalyze in my institute's HPC I conducted the analysis on the altanalyze GUI downloaded from altanalyze website, with the following snapshot as the event annotation file in output. Does it appear similar to your event annotation files? I then used the GUI altanalyze output and conducted SPRINT analysis in HPC. I wish to know what step of the analysis went wrong leading the different values. If possible, do you have a sample event_annotation file I can use to test the setup of my SPRINT pipeline? I also noticed that all values in my prior file are 0, 0.49 and 0.51 (second picture), please let me know if this is what the prior file should be, in case there was data corruption during my download. Thank you very much for your help!
Hi @UCSFJL,
The file you showed is the prior
network of the shape n_event n_rbp, but there should be another file named mdt_estiamte
or mdt_estimate_morpheus
of the shape n_rbp n_sample, I think if you open that one, then that would be the predicted RBP activity in each sample.
Regarding the AltAnalyze, I didn't spot any concern in the EventAnnotation file, I think it's the same output between docker version and GUI. Tagging my advisor @nsalomonis in case he has any additional insights.
So in a nutshell, see if you can find the right estimation output, and we can go from there. Happy to share a EventAnnotation file to compare as well (https://www.dropbox.com/scl/fi/gnfyacbso204zcyajfdbb/Hs_RNASeq_top_alt_junctions-PSI_EventAnnotation.txt?rlkey=90vrkidawu7qc7k4dbnzc23tc&st=0ls6pwed&dl=0).
Thank you, Frank
Thank you very much for your pioneering work! I am especially interested in the SPRINT pipeline and I have a question regarding the creating the meta file as described in the SPRINT file:
I uploaded 3 patient RNAseq bam data into Altanalyze and received the EventAnnotation file, however to run SPRINT, I need to include metadata containing annotations such as conditions and burden, how will I create the metafile from EventAnnotation or other Altanalyze outputs? Furthermore, since the example suggested both tumor and control annotations, do I need to put in RNAseq file of both tumor and control Seq files into Altanalyze? And what does burden number mean in the example? Would that be the tumor purity score? Thank you very much for your help! Please let me know if this step is optional just for the visualization, what might some other annotation be and if you need more information from me!