Some suggestions about OV

lisch7 commented 10 months ago

Hi,

Firstly, I'd like to commend the OV project for its contributions to scRNA analysis. I have a few suggestions that could potentially enhance its utility:

Flexibility in ov.pp.preprocess: This function integrates several key processing steps. However, some steps like robust gene identification and gene filtering are mandatory. It might be beneficial to offer more control here. For instance, adding a parameter such as robust_gene=True, threshold=0.05 could provide users with the option to toggle this feature. Similarly, the mandatory use of sc.pp.normalize_total(..., exclude_highly_expressed=True...) could be made optional with a control parameter.
Expanding regress Functionality: Currently, the regress function seems limited to specific parameters like mito_perc and nUMIs. It would be advantageous to allow regression on other variables as per user requirements.
Concerns with regress_and_scale: In the current implementation, I'm wondering if replacing adata = sc.pp.regress_out(adata, ['mito_perc', 'nUMIs']) with adata_mock = sc.pp.scale(adata_mock) at line 471 might be more appropriate. This change could potentially improve the function's performance or accuracy.

I believe these enhancements could make OV even more flexible and user-friendly for diverse scRNA analysis scenarios. Looking forward to your thoughts on this.

lisch7 commented 10 months ago

Moreover, why not enroll Celltypist and BBKNN into OV? I think some of your previous excellent strategies for batch removal and annotation on Wechat Official Accounts can also be integrated into the OV process.

Starlitnightly commented 10 months ago

Thank you for your suggestion, here is the response to your suggestion:

1. more parameters are indeed optional in preprocess, it is worth noting that we auto-calculate the high variable genes, but instead of auto-filtering, we store the raw values in adata.layers['counts']. So you can use adata.layers['counts'] to get the raw values if you need to.
1. any regression is not implemented in omicverse, this is due to the fact that the regression will interpolate the dropout phenomenon, but is this interpolation really correct? Much of the literature expresses concern about this
1. instead of modifying the original data, we stored the scaled values in adata.layers['scaled'] for the calculation of pca only.
1. for Celltypist, it's a good auto annotation algorithm, but considering its installation dependency may conflict with omicverse installation, we suggest that users can use it on their own, and we may add it in a future release.
1. For BBKNN, this de-batching algorithm is so obsolete that its performance is so poor that it is not acceptable to anyone. We have compared the results of different de-batching algorithms in detail in the tutorial on batch correction, where the optimal algorithm without GPU is harmony and the optimal algorithm with GPU is scVI.

Thanks again for your suggestions. Zehua

lisch7 commented 10 months ago

Thanks for your kind reply, but it seems that I didn't express what I meant exactly. I appreciate the opportunity to clarify my suggestions:

Flexibility in ov.pp.preprocess. I noticed that in the preprocessing step (ov.pp.preprocess), certain parameters are set by default, such as in the snippet from lines 372-379 in _preprocess.py:
```
sc.pp.normalize_total(
adata, 
target_sum=target_sum,
exclude_highly_expressed=True,
max_fraction=0.2,
)
```
Here, exclude_highly_expressed=True is automatically applied. In scanpy, the default of exclude_highly_expressed is False (https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.normalize_total.html#scanpy.pp.normalize_total). It may be more appropriate to set it to True based on your experience, but I think it may be more appropriate to stay consistent with the classic tutorial. Thus, my suggestion is to allow users to adjust this setting, perhaps through an additional parameter in ov.pp.preprocess. This could provide more control over the preprocessing based on specific data characteristics.
Expanding regress Functionality. In the regress function (lines 437-452 in _preprocess.py), the parameters mito_perc and nUMIs are fixed targets for regression. I propose enhancing this function's flexibility by allowing users to specify which parameters to regress. This flexibility could be crucial for analyses where other variables might be more relevant.
Concerns with regress_and_scale. Regarding the regress_and_scale function, particularly at line 471 in _preprocess.py, I wonder if the code adata_mock = scale(adata_mock) could be modified to adata_mock = sc.pp.scale(adata_mock). As someone relatively new to Python, I'm not sure if this change would be more appropriate or efficient, and would appreciate your insight on this.

Looking forward to your thoughts on this.

Starlitnightly commented 10 months ago

Thanks for your kind reply, but it seems that I didn't express what I meant exactly. I appreciate the opportunity to clarify my suggestions:

Flexibility in ov.pp.preprocess. I noticed that in the preprocessing step (ov.pp.preprocess), certain parameters are set by default, such as in the snippet from lines 372-379 in _preprocess.py:
sc.pp.normalize_total(
    adata, 
    target_sum=target_sum,
    exclude_highly_expressed=True,
    max_fraction=0.2,
)
Here, exclude_highly_expressed=True is automatically applied. In scanpy, the default of exclude_highly_expressed is False (https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.normalize_total.html#scanpy.pp.normalize_total). It may be more appropriate to set it to True based on your experience, but I think it may be more appropriate to stay consistent with the classic tutorial. Thus, my suggestion is to allow users to adjust this setting, perhaps through an additional parameter in ov.pp.preprocess. This could provide more control over the preprocessing based on specific data characteristics.

Expanding regress Functionality. In the regress function (lines 437-452 in _preprocess.py), the parameters mito_perc and nUMIs are fixed targets for regression. I propose enhancing this function's flexibility by allowing users to specify which parameters to regress. This flexibility could be crucial for analyses where other variables might be more relevant.

Concerns with regress_and_scale. Regarding the regress_and_scale function, particularly at line 471 in _preprocess.py, I wonder if the code adata_mock = scale(adata_mock) could be modified to adata_mock = sc.pp.scale(adata_mock). As someone relatively new to Python, I'm not sure if this change would be more appropriate or efficient, and would appreciate your insight on this.

Looking forward to your thoughts on this.

Thanks for your advice, we will add more parameter in next version.

Zehua

Liang-Wu-01 commented 10 months ago

Follow the issue, I had one question want to ask. Do the authors had any abvises for the huge dataset to save the RAM memory when used the OV to do the analysis? Not sure what kind statistics in the pp.reprocess, it take lot of RAM memory, can use gc.collect to release the memory?

lisch7 commented 10 months ago

Additionaly, there are some mistakes in the tutorials (https://omicverse.readthedocs.io/en/latest/Tutorials-single/t_single_batch/).

An example:

And similar problem occurs in other calibration batches of tutorials.

Starlitnightly commented 10 months ago

Follow the issue, I had one question want to ask. Do the authors had any abvises for the huge dataset to save the RAM memory when used the OV to do the analysis? Not sure what kind statistics in the pp.reprocess, it take lot of RAM memory, can use gc.collect to release the memory?

If you want to save on RAM expenses, then you might consider setting argument backed='r' when reading h5ad files using sc.read or ov.read

Liang-Wu-01 commented 10 months ago

Follow the issue, I had one question want to ask. Do the authors had any abvises for the huge dataset to save the RAM memory when used the OV to do the analysis? Not sure what kind statistics in the pp.reprocess, it take lot of RAM memory, can use gc.collect to release the memory?

If you want to save on RAM expenses, then you might consider setting argument backed='r' when reading h5ad files using sc.read or ov.read

backed='r' can help save the memory of pp.preprocessing? I remember that only speed up to reading the files. Whatever, will try.

colinwxl commented 1 month ago

QQ20240918-164142 There is bugs in function ov.pp.regress_and_scale

colinwxl commented 1 month ago

I think that the re-calculation of scaled|original|X_pca should be removed, as the funtion batch_correction dose not ask us to input 'log-transformed counts' or 'scaled data', the later input will lead to double scaled.

lisch7 commented 1 month ago

Hi @Starlitnightly, I noticed that recent tutorials haven't covered the usage of ov.utils.store_layers and ov.utils.retrieve_layers. I believe these are useful functions for saving and retrieving layer counts. Are there any more efficient tools available for this process?

Starlitnightly / omicverse

Some suggestions about OV #48