bowang-lab / scGPT

https://scgpt.readthedocs.io/en/latest/
MIT License
977 stars 182 forks source link

Not able to reproduce GRN results in Tutorial_GRN #208

Open 3XPLOSIV3 opened 3 months ago

3XPLOSIV3 commented 3 months ago

Hello,

I have been working with scGPT for the last week and I am studying currently the GRN part. By running the Tutorial_GRN notebook, and downloading the data/model in the links provided, I am unable to reproduce the article figure (that are the saved outputs of the notebook).

By looking it step by step, it seems that the Preprocessor object is causing the trouble. After the preprocessing I don't have CD3E genes in the data dataset, and I assume that .

I have run the following (same as provided with some print added)

# Specify data path; here we load the Immune Human dataset
data_dir = Path("./data")
logger.info("Loading data ...")
adata = sc.read(str(data_dir / "Immune_ALL_human.h5ad"), cache=True)  # 33506 × 12303
ori_batch_col = "batch"
adata.obs["celltype"] = adata.obs["final_annotation"].astype(str)
data_is_raw = False
logger.info(f"CD3E gene in data : {'CD3E' in adata.var.index.tolist()}, CD3G : {'CD3G' in adata.var.index.tolist()}, CD3D : {'CD3D' in adata.var.index.tolist()}")
logger.info(f"Initial number of genes : {adata.shape[1]}")

# Preprocess the data following the scGPT data pre-processing pipeline
preprocessor = Preprocessor(
    use_key="X",  # the key in adata.layers to use as raw data
    filter_gene_by_counts=3,  # step 1
    filter_cell_by_counts=False,  # step 2
    normalize_total=1e4,  # 3. whether to normalize the raw data and to what sum
    result_normed_key="X_normed",  # the key in adata.layers to store the normalized data
    log1p=data_is_raw,  # 4. whether to log1p the normalized data
    result_log1p_key="X_log1p",
    subset_hvg=n_hvg,  # 5. whether to subset the raw data to highly variable genes
    hvg_flavor="seurat_v3" if data_is_raw else "cell_ranger",
    binning=n_bins,  # 6. whether to bin the raw data and to what number of bins
    result_binned_key="X_binned",  # the key in adata.layers to store the binned data
)
preprocessor(adata, batch_key="batch")
logger.info(f"CD3E gene in data : {'CD3E' in adata.var.index.tolist()}, CD3G : {'CD3G' in adata.var.index.tolist()}, CD3D : {'CD3D' in adata.var.index.tolist()}")
print(f"Final number of genes : {adata.shape[1]}")

And obtain this : scGPT - INFO - Loading data ... scGPT - INFO - CD3E gene in data : True, CD3G : True, CD3D : True scGPT - INFO - Initial number of genes : 12303 scGPT - INFO - Filtering genes by counts ... scGPT - INFO - Normalizing total counts ... scGPT - INFO - Subsetting highly variable genes ... scGPT - INFO - Binning data ... scGPT - INFO - CD3E gene in data : False, CD3G : False, CD3D : False Final number of genes : 1200

I haven't change either the data or the parameter of Preprocessor provided, so I don't understand why the result is different...

Thank you for your answer!

subercui commented 2 months ago

Hi, thank you for the question. @ChloeXWang , could you please help with this?

klshen8386 commented 2 months ago

Hi @subercui @ChloeXWang, I am also experiencing this reproducibility issue. Has any insight been gained towards the solution? Thanks!

3XPLOSIV3 commented 1 month ago

Hi @klshen8386 By luck I found the source of the issue (I was actually dealing with installation errors regarding flash-attn): the version of scanpy on env file is specified with ">=" and not "==". Thus the latest version is installed, changing the algorithm for high variable gene selection.

If you use scanpy==1.9.1, that should solve the issue (I don't know what is the exact version used by @subercui @ChloeXWang though).

NB, when installing scanpy, the version installed for the igraph dependency could induced an error in Leiden clusterisation (sc.pp.leiden). To solve that, I used the latest version (igraph==0.9.11). Check also that the leidenalg dependency is installed with version >= 0.8.* (should be good by default). Cf. https://github.com/scverse/scanpy/issues/2341

I also had an issue with the networkx package, I now use networkx==3.2.1. (I don't know if it is related, just mentioning in case you has the same issue)

Hope it works for you!