broadinstitute / wot

A software package for analyzing snapshots of developmental processes
https://broadinstitute.github.io/wot/
BSD 3-Clause "New" or "Revised" License
140 stars 34 forks source link

How are the gene scores calculated? #51

Open yeroslaviz opened 5 years ago

yeroslaviz commented 5 years ago

In the new version you calculate the cell sets over gene scores. How are these scores being calculated?

geoffschieb commented 5 years ago

This is described in the Methods S1 supplemental section of our paper.

On Tue, May 7, 2019 at 6:38 AM frymor notifications@github.com wrote:

In the new version you calculate the cell sets over gene scores. How are these being calculated?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/wot/issues/51, or mute the thread https://github.com/notifications/unsubscribe-auth/ACJCQYQ3PH636J2NZUB6ITLPUGA6BANCNFSM4HLI5QPQ .

yeroslaviz commented 5 years ago

thanks, but where? I have the Methods S1 part, but I can't find, where the gene scores calculations is explained.

geoffschieb commented 5 years ago

Dear Frymor,

We have just released WOT 1.0 along with a tutorial: https://broadinstitute.github.io/wot/tutorial/

The gene scores calculations are explained in Notebook 1 and the STAR Methods of our paper (page e5, section titled "Creating gene signatures and cell sets").

Best, Geoff

On Thu, May 9, 2019 at 5:36 AM frymor notifications@github.com wrote:

thanks, but where? I have the Methods S1 part, but I can't see, where the gene scores calculations is explained.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/wot/issues/51#issuecomment-490832748, or mute the thread https://github.com/notifications/unsubscribe-auth/ACJCQYXRRMDGMH6EARQ7HIDPUPWDBANCNFSM4HLI5QPQ .

yeroslaviz commented 5 years ago

This looks great and i will definitely test it. But on the fly i have seen, that in the examples you have on the web site, the cell-sets are still being calculated based on the matrix file and not the gene scores. Is that correct?

geoffschieb commented 5 years ago

Actually the cell sets are just loaded there ... not even computed!

In Schiebinger et al 2019 we compute cell sets "by hand" through a mix of clustering and using gene signature scores.

Clustering: we divide the whole dataset into 65 clusters and merge clusters manually to create the IPS, Stromal and Neural cell sets. But this doesn't help us find all cell sets.

The Trophoblast and Epithelial cell sets are not well identified through clustering. They are created through thresholding gene signature scores. This is a very simple computation: 1) compute gene signature scores 2) take top x% to get a cell set.

The MET cell set is defined in an entirely different way! These are defined using the ancestor distributions of iPSC, Trophoblast and Neural.

I'm happy to discuss any of this further.

Best, Geoff

On Sat, May 11, 2019 at 1:37 PM frymor notifications@github.com wrote:

This looks great and i will definitely test it. But on the fly i have seen, that in the examples you have on the web site, the cell-sets are still being calculated based on the matrix file and not the gene scores. Is that correct?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/wot/issues/51#issuecomment-491530453, or mute the thread https://github.com/notifications/unsubscribe-auth/ACJCQYWH25E4LGPLJPCDNF3PU374JANCNFSM4HLI5QPQ .

yeroslaviz commented 5 years ago

wot v. 1.0.0 is not reachable. when doing either

git clone https://github.com/broadinstitute/wot.git
cd wot
pip install -e .

or when tryin to install it via pip install wot

Both command give me only version 0.5.7

Actually the cell sets are just loaded there ... not even computed!

In your example of how to create cell-sets you still use the older command wot cells_by_gene_set --matrix matrix.txt --gene_sets gene_sets.gmt --out cell_sets.gmt --format gmt --quantile 0.99.

joshua-gould commented 5 years ago

It was released today.

On Mon, May 13, 2019 at 3:18 AM frymor notifications@github.com wrote:

wot v. 1.0.0 is not reachable. when doing either

git clone https://github.com/broadinstitute/wot.git cd wot pip install -e .

or when tryin to install it via pip install wot

Both command give me only version 0.5.7

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/wot/issues/51#issuecomment-491704347, or mute the thread https://github.com/notifications/unsubscribe-auth/ABH6TH4DT5BQTK35BFVOSUTPVEI3JANCNFSM4HLI5QPQ .

yeroslaviz commented 5 years ago

I would still like to know if you can help with this issue. I have a calculated the gene-score using the gene_set_scores command and got for each of my 18 clusters a txt file with the corresponding cell ID and mean_z-score (mean, mean_rank) value.

This was followed by the cell-set calculation using the command

wot cells_by_gene_set  --score Output/p2_geneScores_Cluster10.txt --score Output/p2_geneScores_Cluster11.txt --score Output/p2_geneScores_Cluster12.txt --score ... Output/p2_geneScores_Cluster7.txt --score Output/p2_geneScores_Cluster8.txt --score Output/p2_geneScores_Cluster9.txt --out Output/p2_original_cell_sets.gmt

listing all 18 gene score files. I have tried this for all three options, the mean_z-score, the mean and the mean_rank. In all three I get the gmt file with 18 clusters, but all the clusters are exactly the same length, but each with a list different cells (cell IDs).

I was wondering if this makes any sense. Can it be, that all cell sets are of the same length?

thanks, Assa

P.S. 1 If you think it might be helpful, I can share some of the files with you

P.S. 2 I'm really sorry about bombarding you with questions/problems. I just think the method can be very useful and we do want to use it for our data set (and publication). But we need to try to better understand it and free it from the bugs.

geoffschieb commented 5 years ago

Yes it makes sense that all the cell sets would be the same length when you run the command this way.

The command selects the top x percent of cells according to the score. You are using the same value of x for each score. What you need to do differently is to manually look at histograms of these scores and identify a doffeeent cutoff for each score. Then run the command 12 times (once for each score) with a different value of the cutoff.

On Fri, May 31, 2019 at 6:30 AM frymor notifications@github.com wrote:

I would still like to know if you can help with this issue. I have a calculated the gene-score using the gene_set_scores command and got for each of my 18 clusters a txt file with the corresponding cell ID and mean_z-score (mean, mean_rank) value.

This was followed by the cell-set calculation using the command

wot cells_by_gene_set --score Output/p2_geneScores_Cluster10.txt --score Output/p2_geneScores_Cluster11.txt --score Output/p2_geneScores_Cluster12.txt --score ... Output/p2_geneScores_Cluster7.txt --score Output/p2_geneScores_Cluster8.txt --score Output/p2_geneScores_Cluster9.txt --out Output/p2_original_cell_sets.gmt

listing all 18 gene score files. I have tried this for all three options, the mean_z-score, the mean and the mean_rank. In all three I get the gmt file with 18 clusters, but all the clusters are exactly the same length, but each with a list different cells (cell IDs).

I was wondering if this makes any sense. Can it be, that all cell sets are of the same length?

thanks, Assa

P.S. 1 If you think it might be helpful, I can share some of the files with you

P.S. 2 I'm really sorry about bombarding you with questions/problems. I just think the method can be very useful and we do want to use it for our data set (and publication). But we need to try to better understand it and free it from the bugs.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/broadinstitute/wot/issues/51?email_source=notifications&email_token=ACJCQYVSYTYDUAEOAD7JCBDPYD42VA5CNFSM4HLI5QP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWU3M5I#issuecomment-497661557, or mute the thread https://github.com/notifications/unsubscribe-auth/ACJCQYWPPNRYTJWTPUNRQ7DPYD42VANCNFSM4HLI5QPQ .

yeroslaviz commented 5 years ago

This is good to know, thanks. I thought it calculates it separately for each file. Now I understand why I get for each cluster a separate file :-)

But I still don't get how to "know" what a good cutoff means under the circumstances. below are the histograms of two of the clusters. As you can see they look completely different.

What would be for example a good threshold for these two cases?

Screenshot 2019-05-31 15 36 42

What I mean is how would one choose a cutoff which makes sense? (I couldn't find any mention to it in the paper or methods.)

I can see, that there is not good method to automate this step, but it is still confusing. I've tried to understand how you calculated the gene scores for you example data set, but it seems that it is not corresponding to the gene score values (at least it seems so to me). How did on go on about setting this cutoff?

thanks Assa

geoffschieb commented 5 years ago

The cluster 2 histogram is bimodal, so a cutoff of 8000 might be good. Not so sure about cluster 1. You can try visualizing the cells above different thresholds in the fle and see which levels give coherent sets.

On Fri, May 31, 2019 at 8:49 AM frymor notifications@github.com wrote:

This is good to know, thanks. I thought it calculates it separately for each file. Now I understand why I get for each cluster a separate file :-)

But I still don't get how to "know" what a good cutoff means under the circumstances. below are the histograms of two of the clusters. As you can see they look completely different.

What would be a good threshold for these two cases and how would one choose a cutoff?

[image: Screenshot 2019-05-31 14 43 11] https://user-images.githubusercontent.com/8927555/58706510-96dd1780-83b2-11e9-884c-1ab275535bd1.png

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/broadinstitute/wot/issues/51?email_source=notifications&email_token=ACJCQYTBDQV67PZE3X7J4T3PYENE5A5CNFSM4HLI5QP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWVD7AY#issuecomment-497696643, or mute the thread https://github.com/notifications/unsubscribe-auth/ACJCQYSEQP5C7GEWRTY2DUDPYENE5ANCNFSM4HLI5QPQ .

yeroslaviz commented 5 years ago

Thanks for the answer. But I would really like to know, is why you chose 8000 in cluster2 and not sure about cluster1. cluster1 can also be seen (in a way) as bimodal. What does the value 8000 here means? I don't have this kind of high values in the data the cluster1 df looks like

id      Cluster1_score
p2_cortex_143_AAACCTGAGGCTAGGT  0.44846073
p2_cortex_143_AAACGGGCAATCGGTT  0.47955644
p2_cortex_143_AAAGTAGCATAAAGGT  0.3165601
p2_cortex_143_AAATGCCGTGTGTGCC  0.14038791
p2_cortex_143_AACCGCGTCGTTGACA  0.26206216
p2_cortex_143_AACGTTGCATTCACTT  0.36493918

I would like to understand how you choose these values, so that I won't need to upload the histogram each time I have a dataset and ask for your opinion. :-)

geoffschieb commented 5 years ago

Now I'm confused about your histograms because the values all look small in the table you just sent but the x-axis for the histogram has values of like

  1. The histogram for cluster 2 looked bimodal with a minimum around 8000 on the x axis.

Best, Geoff

On Tue, Jun 4, 2019 at 3:08 AM frymor notifications@github.com wrote:

Thanks for the answer. But I would really like to know, is why you chose 8000 in cluster2 and not sure about cluster1. cluster1 can also be seen (in a way) as bimodal. What does the value 8000 here means? I don't have this kind of high values in the data the cluster1 df looks like

id Cluster1_score p2_cortex_143_AAACCTGAGGCTAGGT 0.44846073 p2_cortex_143_AAACGGGCAATCGGTT 0.47955644 p2_cortex_143_AAAGTAGCATAAAGGT 0.3165601 p2_cortex_143_AAATGCCGTGTGTGCC 0.14038791 p2_cortex_143_AACCGCGTCGTTGACA 0.26206216 p2_cortex_143_AACGTTGCATTCACTT 0.36493918

I would like to understand how you choose these values, so that I won't need to upload the histogram each time I have a dataset and ask for your opinion. :-)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/wot/issues/51?email_source=notifications&email_token=ACJCQYRLD7RSWZCC3IUNII3PYYIGJA5CNFSM4HLI5QP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW3UG7Y#issuecomment-498549631, or mute the thread https://github.com/notifications/unsubscribe-auth/ACJCQYRUVCCJFT5DZZSBUITPYYIGJANCNFSM4HLI5QPQ .

yeroslaviz commented 5 years ago

The x-axis now looks also similar to the table above. I have ran the gene-score calculations with all three methods mean_z_score, mean_rankand mean. In the third option the values are high. Below are the last two of the three options. On the right hand side are the mean-calculated gene scores. This is the explanation t=for the differences. Sorry about that.

Screenshot 2019-06-05 14 01 32

But regarding choosing the threshold - Do i understand it correctly, that i should look for a (local) minimum on the histogram of the distributed scores for each of the clusters/gene-sets.

But even If I choose the value 8000 for the mean-calculated or the 0.1 for the mean_z_score-calculated values, what do I do with it?

Where Do I input this value in the wot cells_by_gene_set command. Is this the quantile parameter? Do I need to calculate at what quantile the value 8000 stands in the list of scores?

geoffschieb commented 5 years ago

Yes, that's right: you need to calculate at what quantile the value stands in the lists of scores. Then supply the quantile parameter to the command.

On Wed, Jun 5, 2019 at 8:09 AM frymor notifications@github.com wrote:

Well the x-axis now looks also similar to the table above. I have ran the gene-score calculations with all three methods mean_z_score, mean_rankand mean. In the third option the values are high. Below are the last two of the three options. On the right hand side are the mean-calculated gene scores. This is the explanation t=for the differences. Sorry about that.

[image: Screenshot 2019-06-05 14 01 32] https://user-images.githubusercontent.com/8927555/58954807-75a56e00-879a-11e9-9fd7-ea80a4031c85.png

But regarding choosing the threshold - Do i understand it correctly, that i should look for a (local) minimum on the histogram of the distributed scores for each of the clusters/gene-sets.

But even If I choose the value 8000 for the mean-calculated or the 0.1 for the mean_z_score-calculated values, what do I do with it?

Where Do I input this value in the wot cells_by_gene_set command. Is this the quantile parameter? Do I need to calculate at what quantile the value 8000 stands in the list of scores?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/wot/issues/51?email_source=notifications&email_token=ACJCQYT6AZS7MDUCQFHC33DPY6UJHA5CNFSM4HLI5QP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW7P7PQ#issuecomment-499056574, or mute the thread https://github.com/notifications/unsubscribe-auth/ACJCQYXCVS4YRBJX6SYS3DLPY6UJHANCNFSM4HLI5QPQ .

joshua-gould commented 5 years ago

Another option is to apply filters in Excel to generate your cell sets

On Wed, Jun 5, 2019 at 9:41 AM geoffschieb notifications@github.com wrote:

Yes, that's right: you need to calculate at what quantile the value stands in the lists of scores. Then supply the quantile parameter to the command.

On Wed, Jun 5, 2019 at 8:09 AM frymor notifications@github.com wrote:

Well the x-axis now looks also similar to the table above. I have ran the gene-score calculations with all three methods mean_z_score, mean_rankand mean. In the third option the values are high. Below are the last two of the three options. On the right hand side are the mean-calculated gene scores. This is the explanation t=for the differences. Sorry about that.

[image: Screenshot 2019-06-05 14 01 32] < https://user-images.githubusercontent.com/8927555/58954807-75a56e00-879a-11e9-9fd7-ea80a4031c85.png

But regarding choosing the threshold - Do i understand it correctly, that i should look for a (local) minimum on the histogram of the distributed scores for each of the clusters/gene-sets.

But even If I choose the value 8000 for the mean-calculated or the 0.1 for the mean_z_score-calculated values, what do I do with it?

Where Do I input this value in the wot cells_by_gene_set command. Is this the quantile parameter? Do I need to calculate at what quantile the value 8000 stands in the list of scores?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/broadinstitute/wot/issues/51?email_source=notifications&email_token=ACJCQYT6AZS7MDUCQFHC33DPY6UJHA5CNFSM4HLI5QP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW7P7PQ#issuecomment-499056574 , or mute the thread < https://github.com/notifications/unsubscribe-auth/ACJCQYXCVS4YRBJX6SYS3DLPY6UJHANCNFSM4HLI5QPQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/wot/issues/51?email_source=notifications&email_token=ABH6TH2V73EUX76QICCWYY3PY6673A5CNFSM4HLI5QP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW7XRSQ#issuecomment-499087562, or mute the thread https://github.com/notifications/unsubscribe-auth/ABH6TH2AE2AALZ6SLUQTFS3PY6673ANCNFSM4HLI5QPQ .

yeroslaviz commented 5 years ago

Yes, that's right: you need to calculate at what quantile the value stands in the lists of scores. Then supply the quantile parameter to the command.

Thanks for the reply. But I must admit that this is not really intuitive. As one can see above, the histogram is not always so clear (cluster 1 above). Can you explain to me, what the reason is for taking the minimum? This can be automated, if it is just that.

Another option is to apply filters in Excel to generate your cell sets

What would this filter be like than? sorting the scores in a decreasing order and choose the 0.99 quantile?

yeroslaviz commented 5 years ago

In the newer version (1.0.4) you've modified the script so that it creates one output file for the gene scores in a big table. This is a much better and more efficient way of handling the data. thanks for that.

My problem though is now, that my problem above returns. If I have multiple data sets, how can I calculate separate cell-sets for each of them.

The command selects the top x percent of cells according to the score. You are using the same value of x for each score. What you need to do differently is to manually look at histograms of these scores and identify a doffeeent cutoff for each score. Then run the command 12 times (once for each score) with a different value of the cutoff.

But now this is not possible, as they are all in one table.

Any ideas how to do that?

joshua-gould commented 5 years ago

You can use the command line argument filter to select column names to include.

On Mon, Jun 17, 2019 at 10:03 AM frymor notifications@github.com wrote:

In the newer version (1.0.4) you've modified the script so that it creates one output file for the gene scores in a big table. This is a much better and more efficient way of handling the data. thanks for that.

My problem though is now, that my problem above returns. If I have multiple data sets, how can I calculate separate cell-sets for each of them.

The command selects the top x percent of cells according to the score. You are using the same value of x for each score. What you need to do differently is to manually look at histograms of these scores and identify a doffeeent cutoff for each score. Then run the command 12 times (once for each score) with a different value of the cutoff.

But now this is not possible, as they are all in one table.

Any ideas how to do that?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

yeroslaviz commented 5 years ago

Hello again, sorry to keep insisting on it, but I would really like to understand more about the method of converting the gene-sets into cell-sets.

For better analysis I am using your gene-set file with my expression file ( put aside whether or not it makes biologically sense). I have created the gene-scores using this command:

wot gene_set_scores --matrix p2_matrix.transposed.h5ad --method mean_z_score --gene_sets_Ex.gmx --out gene_Scores_EX

and got a gene score file

id  MEF.identity_score  Pluripotency_score  Cell.cycle_score ...
p2_cortex_143_AAACCTGAGGCTAGGT  0.09748856  -0.15128791 -0.09660525 ...
p2_cortex_143_AAACGGGCAATCGGTT  -0.024704605    -0.16561882 -0.21638143 ...
p2_cortex_143_AAAGTAGCATAAAGGT  0.07285158  -0.03847813 -0.1316462 ...
...

with these scores I would like to calculate my cell sets. As you mentioned above it, to be able to find the best value for the quantile, it would be a good idea to create a histogram of the scores (s. below) and look for the local minimum.

Screenshot 2019-07-17 13 59 07

Calculating the quantiles for e.g. ER.stress scroes (right histogram) gives me the following values:

      80%       85%       90%       95%       99%
0.1202166 0.1641223 0.2210341 0.3048465 0.4497572

I guess in this case I can take the .99 value for a quantile, as it can be considered as a local minima, but this would be more difficult in a histogram as i have posted previously, where one has a bi-modal behavior. Is there a "general rule" one can rely on to choose the "correct" value?