GuangyuWangLab2021 / cellDancer

Predict RNA velocity through deep learning
https://guangyuwanglab2021.github.io/cellDancer_website/
BSD 3-Clause "New" or "Revised" License
60 stars 11 forks source link

function adata_to_df_with_embed #2

Closed Polligator closed 1 year ago

Polligator commented 1 year ago

my understanding is that this function tries to pull splicing data from Ms and Mu layers in the anndata object. I tried to apply this function by providing a gene list / no gene list on my own data, but both failed. after looking into this function, I think it is very problematic: 1. it appears to me the error I have is due to my splicing data being stored as a sparse matrix. I have to modify the sub-function "adata_to_raw_one_gene" to make it work. please consider making changes to this function so that it will more likely work on real-world data. Also, it might be better to have a verbose setting to turn off the processing printings or use a progress bar, something like: for i,gene in tqdm(enumerate(gene_list), total=len(gene_list)) I don't think those meaningless prints are helpful.

  1. look the following loop: for i,gene in enumerate(gene_list): data_onegene = adata_to_raw_one_gene(adata, us_para=us_para, gene=gene) if i==0: data_onegene.to_csv(save_path,header=True,index=False) else: data_onegene.to_csv(save_path,mode='a',header=False,index=False) this is insane, you do NOT write to disk every time you loop through a gene.
Abclisy commented 1 year ago

Thank you very much for the suggestions!

  1. For the sparse matrix problem, thank you for pointing this out! We will update our code in a later version to solve it.

If any other user meets this problem, another way to bypass this problem in the current version of cellDancer (1.1.4) is by transferring the data type in adata before running adata_to_df_with_embed(). For example, adata.layers['Mu']=adata.layers['Mu'].toarray()

  1. I agree that adding a progress bar is much more readable for the users. We will improve this point in a later version.

  2. For the concern of writing to disk, we have purposed and tested several ways for this preprocessing step, the current version is the most time-efficient way. Do you have any suggestions for improving this? We are continuing to improve our preprocessing pipeline.

Abclisy commented 1 year ago

cellDancer is able to show the progress bar when transferring between adata and pandas dataframe now (https://github.com/GuangyuWangLab2021/cellDancer/commit/9a75b2be9771f86a0e8e70ca9f1d2cf20fca273b). User can download the source code and use pip install 'your_path/Source Code/cellDancer' to reinstall it. This version will be updated in the pypi later. We are continuing to improve our preprocessing pipeline. We will be closing this issue for now. However, please don't hesitate to reopen it or create a new issue if you have further questions or concerns. Thank you for your understanding.