frankligy / SNAF

Splicing Neo Antigen Finder (SNAF) is an easy-to-use Python package to identify splicing-derived tumor neoantigens from RNA sequencing data, it further leverages both deep learning and hierarchical Bayesian models to prioritize certain candidates for experimental validation
MIT License
38 stars 8 forks source link

Creating meta file within SPRINT pipeline #43

Open UCSFJL opened 2 months ago

UCSFJL commented 2 months ago

Thank you very much for your pioneering work! I am especially interested in the SPRINT pipeline and I have a question regarding the creating the meta file as described in the SPRINT file:

Screenshot 2024-06-11 at 5 05 41 PM

I uploaded 3 patient RNAseq bam data into Altanalyze and received the EventAnnotation file, however to run SPRINT, I need to include metadata containing annotations such as conditions and burden, how will I create the metafile from EventAnnotation or other Altanalyze outputs? Furthermore, since the example suggested both tumor and control annotations, do I need to put in RNAseq file of both tumor and control Seq files into Altanalyze? And what does burden number mean in the example? Would that be the tumor purity score? Thank you very much for your help! Please let me know if this step is optional just for the visualization, what might some other annotation be and if you need more information from me!

frankligy commented 2 months ago

Hi @UCSFJL,

Sorry for the confusion, it's actually meant to be simple, firstly as you alluded to, this is merely for generating an annotated dataframe after running RNA-SPRINT so you can easily upload to Morpheus for visualization. It has nothing to do with the inference step, and there's no requirement for number of columns or type of columns, you can add whichever annotation you deem important (or whatever you'd like to visualize), the condition and burden were just meant to serve as an illustration to show you can do both categorical data and continuous data.

I am pretty sure you can just do something like below and it should work just fine, but let me know if you run into any problem and I can further assist.

sample          dummy_col
sample1            dummy
sample2           dummy
sample3           dummy

Best, Frank

UCSFJL commented 2 months ago

Thank you very much for your prompt response! I completed the analysis as you described and the result looks promising! I am having a little difficulty interpreting the output number, I noticed that the MDT predicted RBP activity was given a score between 0-1 in contrast to their benchmark activity, where higher RBP activity correspond to a higher score. But how was this data normalized? Is the score linear normalized or a probability of confidence, for example does a score of 0.8 indicate a twofold activity compare to a activity score of 0.4? I would really appreciate some insights into how may I interpret the numbers for the convenience of downstream analysis, and thank you very much for your help!

frankligy commented 2 months ago

Hi @UCSFJL,

Sorry please correct me if I misunderstood, but I am not sure what "the MDT predicted RBP activity was given a score between 0-1" mean, for instance, see following screenshot, the estimate should not be bound by 0-1 right?

Screenshot 2024-06-19 at 12 59 06 PM

So we are strictly following the MDT implementation from decoupler-py package, the source code for MDT is here (https://github.com/saezlab/decoupler-py/blob/main/decoupler/method_mdt.py). The idea is the prior network serve as the X and the PSI value for each splicing event is the Y. After fitting a random forest regression model, the feature importance for each predictor (aka splicing factor) can be inferred based on impurity based importance measurement (https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#feature-importance-based-on-mean-decrease-in-impurity), which serve as the "activity" in the final output.

So It is a bit hard to say whether they are linear normalized or not, but more of a heuristic. In terms of confidence, I don't think any scikit-learn implemented function will provide confidence (not to say we can not derive one).

The benchmark I presented in the paper is largely based on rank, so based on that, I would say I'd interpret in a more qualitative way, if a splicing factor's activity is consistently higher in tumor than normal, then I'd believe it has its role in tumorigenesis. I think the absolute values should also be informative, but since I didn't conduct any benchmark on that, I don't have scientific evidence to suggest whether 0.8 is exactly two fold more active than 0.4.

Hoping this helps a bit, Frank

UCSFJL commented 2 months ago

That is very interesting, and thank you for the insightful comments regarding the values of the output. Due to some reason, the value of my SPRINT outputs are consistently between 0-1, and the row and column are reversed compare to your example (in my case, RBP:event location as columns and samples as rows), not sure if these differences will influence the analysis but I would love to eliminate the differences. Due to issues with downloading altanalyze in my institute's HPC I conducted the analysis on the altanalyze GUI downloaded from altanalyze website, with the following snapshot as the event annotation file in output. Does it appear similar to your event annotation files? I then used the GUI altanalyze output and conducted SPRINT analysis in HPC. I wish to know what step of the analysis went wrong leading the different values. If possible, do you have a sample event_annotation file I can use to test the setup of my SPRINT pipeline? I also noticed that all values in my prior file are 0, 0.49 and 0.51 (second picture), please let me know if this is what the prior file should be, in case there was data corruption during my download. Thank you very much for your help!

Screenshot 2024-06-30 at 5 47 10 PM Screenshot 2024-06-30 at 9 14 11 PM
frankligy commented 2 months ago

Hi @UCSFJL,

The file you showed is the prior network of the shape n_event n_rbp, but there should be another file named mdt_estiamte or mdt_estimate_morpheus of the shape n_rbp n_sample, I think if you open that one, then that would be the predicted RBP activity in each sample.

Regarding the AltAnalyze, I didn't spot any concern in the EventAnnotation file, I think it's the same output between docker version and GUI. Tagging my advisor @nsalomonis in case he has any additional insights.

So in a nutshell, see if you can find the right estimation output, and we can go from there. Happy to share a EventAnnotation file to compare as well (https://www.dropbox.com/scl/fi/gnfyacbso204zcyajfdbb/Hs_RNASeq_top_alt_junctions-PSI_EventAnnotation.txt?rlkey=90vrkidawu7qc7k4dbnzc23tc&st=0ls6pwed&dl=0).

Thank you, Frank