bhavaygg / InSTAnT

Intracellular Transcriptomic Analysis Toolkit (InSTAnT)
MIT License
4 stars 0 forks source link

Issue with InSTAnT Package and Data Preparation #6

Open kulansam opened 3 weeks ago

kulansam commented 3 weeks ago

Hi,

Thank you for developing the InSTAnT package for exploring the patterns of gene-pair co-localization in spatial transcriptomics data.

I have a MERFISH dataset and would like to use this software to detect the co-localization patterns of genes. However, when I attempt to import my data using the 'obj.preprocess_and_load_data' and 'obj.load_preprocessed_data' functions, I encounter an error.

For example, when I run the following command: obj.preprocess_and_load_data(expression_data='./detected_transcripts.csv', barcode_data='./codebook.csv') I receive the following error: KeyError: "['bit_barcode'] not in index"

  1. "detected_transcripts.csv" content: ,barcode_id,global_x,global_y,global_z,x,y,fov,gene,transcript_id,cell_id 288,218,11622.136,6416.4404,0.0,1375.147,762.6618,0,genename,ENSMUSTXXXX,-1 18,242,11647.205,6354.178,0.0,1607.2699,186.16031,0,genename,ENSMUSTXXXX-1
  2. codebook.csv content: name,id,barcodeType,V0001T8B1,V0002T8B1,V0003T8B1,V0004T8B1,V0005T8B1,V0006T8B1,V0007T8B1,V0008T8B1,V0009T8B1,V0010T8B1,V0011T8B1,V0012T8B1,V0013T8B1,V0014T8B1,V0015T8B1,V0016T8B1,V0017T8B1,V0018T8B1,V0019T8B1,V0154T8B1,V0021T8B1 genename,ENSMUSTXXXX,merfish,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0, genename,ENSMUSTXXXX,merfish,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,

Could you please let me know how to resolve this error? It would be greatly appreciated if you could provide guidance on how to prepare the input tables for running InSTAnT.

Also, can you let me know the content/information of processed file (example file in your tutorial: obj.load_preprocessed_data(data = f'data/u2os_new/data_processed.csv')?

Thank you for your assistance!

bhavaygg commented 3 weeks ago

Hi,

I would suggest using a preprocessed .csv file if available. The format of the file is the same as mentioned in the README and I am adding it below as well

gene uID absX absY
AKAP11 2 -1401.666 -2956.618
SIPA1L3 3 -1411.692 -2936.609
THBS1 925 -764.6989 -1604.828

The four columns are gene (gene name), uID (cell ID), absX (X coordinate), absY (Y coordinate) and if the data is 3D, you can add absZ as well.

If you have a file in this format, you can use the obj.load_preprocessed_data() function. Please let me know if you need any additional help.

kulansam commented 3 weeks ago

Hi,

Thank you for your quick response.

I have my MERFISH data ready after nucleus-level cell segmentation using CellPose. Could you please guide me on how to prepare the specific input file from the output? I am unsure how to obtain the absX (X coordinate), absY (Y coordinate), and uID (cell ID) values.

However, in my case, the detected_transcripts.csv file contains the following information. I am planning to use this file to prepare the input file for InSTAnT run.

index,barcode_id,global_x,global_y,global_z,x,y,fov,gene,transcript_id,cell_id 288,218,11622.136,6416.4404,0.0,1375.147,762.6618,0,genename,ENSMUSTXXXX,3965824400107100661 18,242,11647.205,6354.178,0.0,1607.2699,186.16031,0,genename,ENSMUSTXXXX-1

I am planning to extract the following columns: gene, cell_id (UID), global_x, global_y, and global_z (3D -level). Could you please confirm if this is the correct approach?

Additionally, it would be incredibly helpful if you could provide specific code or a pipeline to help me format this file correctly.

Thank you for your assistance!

bhavaygg commented 3 weeks ago

Hi,

The file you mention looks correct. You just need to rename the columns accordingly -

There is no additional preprocessing needed on top of this if the column names are correct.

kulansam commented 2 weeks ago

Thank you for your assistance. I have successfully loaded the data; however, I am encountering a memory issue when running the run_ProximalPairs3D() function. Here is the error message I received:

_Running PP-3D now on 4 threads for, 115372 cells, 108834798 transcripts /var/spool/slurmd/job3601435/slurm_script: line 16: 3742652 Killed python instant_colocai.py

slurmstepd-compute-7-6: error: Detected 2 oomkill events in StepId=3601435.batch. Some of the step tasks have been OOM Killed.

I am currently using a system with 100 GB of memory and 4 threads. Could you please help me troubleshoot this problem?

bhavaygg commented 2 weeks ago

Can you let me know the size of your gene panel as well? Also, is it possible to ask for more memory? I would suggest randomly sampling cells to <20k(I would suggest around 10k) in order to run in a decent time given you only have 4 threads. The algorithm constructs a Cells X Genes X Genes matrix which is the primary memory consumption source.

kulansam commented 2 weeks ago

Can you let me know the size of your gene panel as well? -315 genes Also, is it possible to ask for more memory? -I have tried with 140GB, still same problem I would suggest randomly sampling cells to <20k(I would suggest around 10k) in order to run in a decent time given you only have 4 threads.

bhavaygg commented 2 weeks ago

i may missed some of the co-localization gene pairs right?

This would be an issue in a diverse tissue like brain. Can I know which tissue are you running on? If cell types are annotated, you can sample based on cell type and get around 20k cells which should allow you to run in 100GB.

kulansam commented 2 weeks ago

Can I know which tissue are you running on?

anurendra commented 2 weeks ago

@kulansam, Depending on the question you're asking, sampling may not be an issue. For example, if you're interested in d-colocalization (global co-localization), you should mostly recover colocalizing gene pairs. This is because d-colocalization is robust to the number of cells. You can test it yourself, by first running on a sample of 10k cells, then running on a sample of 5k cells (subset of 10k sampled cells) and obtain False positives and false negatives. The signal you may miss is colocalization specific to a rare cell type. If you want to also recover colocalization from rare cell types, you should sample based on cell type.

kulansam commented 2 weeks ago

Thank you! I am currently running the algorithm by downsampling cells based on the expression of specific genes, which significantly reduces the sample size. However, when I run the algorithm, I encounter the following error:

Running PP-3D now on 4 threads for, 12615 cells, 2366637 transcripts /home/spatial/envs/instant/lib/python3.10/site-packages/anndata/_core/anndata.py:183: ImplicitModificationWarning: Transforming to str index. warnings.warn("Transforming to str index.", ImplicitModificationWarning) Cell-wise Proximal Pairs Time : 90.4 seconds STEP2: run_ProximalPairs3D DONE Running Global Colocalization now on 4 threads Number of cells: 106088, Number of genes: 9 Global Colocalization initialized .. Low Precision Global Colocalization Time: 19.41 seconds Traceback (most recent call last): File "/home//spatial/envs/instant/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 1153, in new engine = config.get_option(f"io.excel.{ext}.writer", silent=True) File "/home//spatial/envs/instant/lib/python3.10/site-packages/pandas/_config/config.py", line 272, in call return self.func(*args, **kwds) File "/home//spatial/envs/instant/lib/python3.10/site-packages/pandas/_config/config.py", line 146, in _get_option key = _get_single_key(pat, silent) File "/home//spatial/envs/instant/lib/python3.10/site-packages/pandas/_config/config.py", line 132, in _get_single_key raise OptionError(f"No such keys(s): {repr(pat)}") pandas._config.config.OptionError: No such keys(s): 'io.excel.csv.writer'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/lab-share/Cardio-Chen-e2/Public//colocalization_instant/instant_colocai.py", line 31, in obj.run_GlobalColocalization( File "/home//spatial/envs/instant/lib/python3.10/site-packages/InSTAnT/InSTAnT.py", line 1448, in run_GlobalColocalization self._save_unstacked_pvals(unstacked_pvals_name, alpha_cellwise, min_transcript) File "/home//spatial/envs/instant/lib/python3.10/site-packages/InSTAnT/InSTAnT.py", line 1379, in _save_unstacked_pvals unstacked_global_pvals.to_excel(filename) File "/home//spatial/envs/instant/lib/python3.10/site-packages/pandas/core/generic.py", line 2345, in to_excel formatter.write( File "/home//spatial/envs/instant/lib/python3.10/site-packages/pandas/io/formats/excel.py", line 946, in write writer = ExcelWriter( # type: ignore[abstract] File "/home//spatial/envs/instant/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 1157, in new raise ValueError(f"No engine for filetype: '{ext}'") from err ValueError: No engine for filetype: 'csv'

My code is below: sample_name='cntrl' obj.load_preprocessed_data(data = "./"+str(sample_name)+"processed_instant_detected_transcripts.csv") print ("STEP1: DATA LOADED DONE") obj.run_ProximalPairs3D(distance_threshold = 4, min_genecount = 20, pval_matrix_name = str(sample_name)+"_pvals.pkl", gene_count_name = str(sample_name)+"_gene_count.pkl") print ("STEP2: run_ProximalPairs3D DONE")

obj.run_GlobalColocalization( high_precision = False, alpha_cellwise = 0.05, glob_coloc_name = str(sample_name)+"global_colocalization.csv", exp_coloc_name = str(sample_name)+"expected_colocalization.csv", unstacked_pvals_name = str(sample_name)+"unstacked_global_pvals.csv") print ("STEP3: obj.run_GlobalColocalization DONE")

a_reordered.columns = ['gene', 'uID', 'absX','absY','absZ']- dataframe has the processed info as dataframe.

a_reordered1=a_reordered[['uID', 'absX','absY','absZ']] a_reordered1.to_csv(str(sample_name)+"_cells_locations.csv",index=False,header=True) print ("STEP4: obj.run_spatial_modulation DONE") obj.run_spatial_modulation(str(sample_name)+"_cells_locations.csv", inter_cell_distance = 100, spatial_modulation_name = str(sample_name)+"_spatial_modulation.csv") print("ALL STEPS DONE") ~
After successfully running the obj.run_ProximalPairs3D() function, I encountered an error while executing the obj.run_GlobalColocalization() function. Could you please help me resolve this issue?

bhavaygg commented 2 weeks ago

If you change the following line, it should work

unstacked_pvals_name = str(sample_name)+"unstacked_global_pvals.csv")

to

unstacked_pvals_name = str(sample_name)+"unstacked_global_pvals.xlsx")