Strategy for selections of genes

jfnavarro commented 3 years ago

Hi Alma,

The current strategy (if I am not mistaken) is to fit the models from the SC data using either all or top N genes. When the top N genes option is used, the top N genes in the ST data will be selected prior the intersection with the SC genes to estimate the proportions. I think this could lead to unwanted scenarios where some of the top N SC genes are not present in the top N ST genes. There is no right or wrong but I believe it may not be necessary to apply a top N filter to the ST data if it has been applied to the SC data. Or perhaps this is intended? It may be then a good idea to make it optional (top N filter for ST data)? On a side note, as you said in the docs, a highly variable genes option would be nice to have. Are you willing to take PRs?

Also, in relation to this, I believe stereoscope does not intersect the ST and SC genes prior fitting of the SC models because the SC models may be used in more than on ST dataset (among other reasons) but it may be a nice feature to have a flag to optionally allow the intersection of ST and SC genes prior fitting the SC models. Again, there is no right or wrong here.

almaan commented 3 years ago

Hi @jfnavarro, so happy to hear from you!

as always you raise some very good points, that I will try answer to the best of my ability.

The current recommended strategy is actually to use a custom gene list based on the most highly variable genes, the tutorial reads: NOTE : In the original manuscript we used the top 5000 (w.r.t. expression levels) genes to show that stereoscope produces good results without any need for pre-processing. However, for optimal results we would recommend a more sophisticated selection of genes; one suggestion is to use the function highly_variable_genes from scanpy's pre-processing module (see documentation). Once a set of genes has been extracted, these can be provided (as a .txt file) to stereoscope using the -gl flag. In the next update, this will be introduced as an option in the run module.

This approach has given slightly more "crisp" results compared to only basing the selection on the top expressed genes in the single cell data. Especially when working with very "fine" cell types defined by only a select set of marker genes. I'm having some plans to include this as an option, where one does not have to specify a custom list but can use select the flavor of gene selection as either "topn" or "hvg".

However, if one is not using a custom gene list, the intended approach was indeed to fit the model to the top N genes in the SC data. The genes used for the proportion estimate process later on will be the intersection between these top N genes and all the genes present in the ST data, so if certain genes present in the SC data are absent in the ST data, they will not be included in the analysis. I decided to use this strategy because: (1) the fitted SC parameters can be reused and applied on different ST data sets, so I didn't want the ST data to dictate which genes were select in case one later on uses a different ST dataset, and (2) since the cell types are defined in the SC data it seems more apt to let the SC data dictate which genes that are to be used.

Now, I think I see your point and the concerns you raise, assuming that your are referring to the code: st_data = D.make_st_dataset(args.st_cnt, topn_genes = args.topn_genes, min_counts = args.min_st_counts, min_spots = args.min_spots, filter_genes = args.filter_genes, transpose = args.st_transpose, )

in the run.py file? Where the command topn_genes=args.topn_genes will apply a gene selection step before intersecting with the genes used in the single cell data. That's very well spotted and something that should be changed. I'm happy to take a PR for this but could also fix it myself.

Regarding having a flag for ST and SC intersection, that's an excellent suggestion that I'm more than happy to take a PR for as well. In general, I'm very open to any PR, it's great if people want to contribute.

Hope all is good, and thanks for the input!

jfnavarro commented 3 years ago

Hi @jfnavarro, so happy to hear from you!

as always you raise some very good points, that I will try answer to the best of my ability.

The current recommended strategy is actually to use a custom gene list based on the most highly variable genes, the tutorial reads: NOTE : In the original manuscript we used the top 5000 (w.r.t. expression levels) genes to show that stereoscope produces good results without any need for pre-processing. However, for optimal results we would recommend a more sophisticated selection of genes; one suggestion is to use the function highly_variable_genes from scanpy's pre-processing module (see documentation). Once a set of genes has been extracted, these can be provided (as a .txt file) to stereoscope using the -gl flag. In the next update, this will be introduced as an option in the run module.

This approach has given slightly more "crisp" results compared to only basing the selection on the top expressed genes in the single cell data. Especially when working with very "fine" cell types defined by only a select set of marker genes. I'm having some plans to include this as an option, where one does not have to specify a custom list but can use select the flavor of gene selection as either "topn" or "hvg".

Yes :) That is what I tried to say in my comment. I also believe the highly variable genes option should be preferred.

However, if one is not using a custom gene list, the intended approach was indeed to fit the model to the top N genes in the SC data. The genes used for the proportion estimate process later on will be the intersection between these top N genes and all the genes present in the ST data, so if certain genes present in the SC data are absent in the ST data, they will not be included in the analysis. I decided to use this strategy because: (1) the fitted SC parameters can be reused and applied on different ST data sets, so I didn't want the ST data to dictate which genes were select in case one later on uses a different ST dataset, and (2) since the cell types are defined in the SC data it seems more apt to let the SC data dictate which genes that are to be used.

Yes, those were the reasons that I thought were behind. Although, a flag letting the user decide on the intersection strategy (pre or post) may be handy.

Now, I think I see your point and the concerns you raise, assuming that your are referring to the code: st_data = D.make_st_dataset(args.st_cnt, topn_genes = args.topn_genes, min_counts = args.min_st_counts, min_spots = args.min_spots, filter_genes = args.filter_genes, transpose = args.st_transpose, )

in the run.py file? Where the command topn_genes=args.topn_genes will apply a gene selection step before intersecting with the genes used in the single cell data. That's very well spotted and something that should be changed. I'm happy to take a PR for this but could also fix it myself.

Yes, exactly!

Regarding having a flag for ST and SC intersection, that's an excellent suggestion that I'm more than happy to take a PR for as well. In general, I'm very open to any PR, it's great if people want to contribute.

Hope all is good, and thanks for the input!

Thanks for your reply! Hope all is well for you too! I may create a PR with these changes if I find the time soon but good to hear that you are planning on doing them yourself too.

Keep in touch!

jfnavarro commented 3 years ago

so I went ahead a did the changes myself (see PR). I did some tests and they completed fine.

johnyaku commented 2 years ago

Interesting discussion! I'm also attracted to the idea of providing a --gene_list of highly variable genes. I see that scanpy.pp.highly_variable_genes() offers three 'flavors' for selecting genes: 'seurat' (default), 'seurat_v3' and 'cell_ranger'. Do you have a favorite flavor? Do you recommend running with the default parameters, or do you have a suggested configuration? In particular, if your favorite flavor is 'seurat_v3' do you have a recommendation for n_top_genes?

almaan / stereoscope

Strategy for selections of genes #24