Sage-Bionetworks / NF_LandscapePaper_2019

This repository hosts all the code used to generate analyses and figures for the landscape paper
3 stars 1 forks source link

evaluate the NF-high LVs correlation to expression in recount2 #17

Closed allaway closed 4 years ago

allaway commented 4 years ago

Look back to recount2 LV expression matrix to assess which samples are most like the NF tumors, and to evaluate the biology of individual LVs highlighted in NF tumor analysis.

allaway commented 4 years ago

Questions: -Did Taroni et al do this in the original publication? (gotta check) -Is this the right approach, or are the better ways of tying our results to more generalized biological phenomena?

allaway commented 4 years ago

Maybe @jaclyn-taroni and @cgreene would have insight into this issue?

cgreene commented 4 years ago

What we have found helpful in the past (ADAGE/eADAGE paper - denoising autoencoder models but similar overall idea) was to identify key LVs of interest and examine which samples were at the extremes of the LV distribution for those in the broader compendium. Focusing on pulling back recount2 samples that are at the extremes of a few key LVs that are high in NF seems like an interesting and potentially productive effort. I'm not sure that this would generally help you assess which samples are most like the NF tumors (for that you might want to do correlation over the full latent space), but it may be more useful.

Doing correlation in the compressed space vs correlation in the raw gene expression space will be pretty similar - the difference is in whether you want signals that are well modeled as one or a few PLIER LVs but that are generally spread across a large number of genes collapsed. In other contexts, we have found this collapsing to be helpful. I'd expect the differences here to be somewhat subtle though.

jaclyn-taroni commented 4 years ago

I'm not sure that this would generally help you assess which samples are most like the NF tumors (for that you might want to do correlation over the full latent space)

Wanted to note that I'm expecting to see lots of zeroes/near zero values in the full latent space. This is somewhat intuitive -- we wouldn't necessarily expect immune cell signals to be relevant in all cell line experiments that are in recount2. That may influence your plan of attack and why I think

What we have found helpful in the past (ADAGE/eADAGE paper - denoising autoencoder models but similar overall idea) was to identify key LVs of interest and examine which samples were at the extremes of the LV distribution for those in the broader compendium. Focusing on pulling back recount2 samples that are at the extremes of a few key LVs that are high in NF seems like an interesting and potentially productive effort.

may be a good bet.

allaway commented 4 years ago

Thank you for the suggestions, @jaclyn-taroni and @cgreene!

I'm not sure that this would generally help you assess which samples are most like the NF tumors (for that you might want to do correlation over the full latent space)

Wanted to note that I'm expecting to see lots of zeroes/near zero values in the full latent space. This is somewhat intuitive -- we wouldn't necessarily expect immune cell signals to be relevant in all cell line experiments that are in recount2.

It sounds like you are both in agreement so this will be my first plan, since we already have a short list of these:

key LVs of interest

I'll look at global correlation as well, but will deprioritize that for now....

Thanks again!

allaway commented 4 years ago

We looked at this in 12-Interesting-LVs-in-recount2

cgreene commented 4 years ago

Now that I've had the chance to look at https://sage-bionetworks.github.io/NF_LandscapePaper_2019/results/12-Interesting-LVs-in-recount2.html I have some potential suggestions.

There are quite a lot of significant LVs between MPNST and pNF tumors - perhaps more than I would have naively suspected.

The results in the correlation table are interesting. It would probably be helpful to summarize them in some way to understand which properties of samples are enriched among the highly correlated set for both of the types.

You could calculate the tf-idf for words in the top, say, 1% of samples by correlation for each cancer type relative to the full set (probably after filtering out english language stopwords). That might help to convert the large table into a short, digestible summary based on the content in the table.

allaway commented 4 years ago

Based on your comment, I was initially thinking I should look at the tf-idf of terms in just the tumor-type correlated descriptions and compare high-tf-idf values the the tf-idf of the full recount2 'corpus,' but the issue I am run into here is tokenizing the recount2 descriptions - running into a memory limit. Maybe this is too much information to tokenize, and I should just do the 1% of samples and not worry about the comparison. Another slight difficulty is when aggregating the correlation across the full tumor type, the correlation to any particular recount2 sample drops considerably (max 0.5 or so for an individual sample, but only 0.2 max on average across all of a given tumor type) , so maybe I should only consider a single sample at a time to get better-correlated sampels, and then calculate mean tf-idf across the tumor types...

cgreene commented 4 years ago

Are you using the sklearn implementation? I didn't realize that it would run out of memory - the set doesn't seem that large. How large is the recount2 sample description corpus? Here's a stackoverflow on python implementations: https://stackoverflow.com/questions/25145552/tfidf-for-large-dataset

allaway commented 4 years ago

Nope! I'm using tidytext which I am guessing is not as efficient as sklearn. It's actually not the tf-idf calculation giving me issues, but tokenization of the dataset into a tidy one-row-per-word.

The set is not that large - ~50k samples where each has a fairly short description string. The sample data for the tidytext manual is 6 entire jane austen novels, which I am guessing is probably larger than the recount2 metadata-dataset.

So there are two likely possibilities - 1 -> that it's too much data for the unnest_tokens function to handle (unlikely, I think) or 2 -> that I'm doing something wrong.

Either way, it sounds like sklearn might be better (...and gives diffferent results... https://stats.stackexchange.com/questions/379663/can-cosine-similarity-be-used-to-measure-similarity-between-words) than tidytext.

I'll take another stab at the R approach but if I continue to run into issues I'll try out sklearn. Thanks!

allaway commented 4 years ago

I think we will revisit this later.