ShobiStassen / PARC

MIT License
41 stars 11 forks source link

Bugs in Example Usage 3 #7

Closed esimonds closed 4 years ago

esimonds commented 4 years ago

There are a few bugs and challenges when working with Example Usage 3 on the GitHub home page

Finding the data

For starters, it's hard to find the appropriate raw data. The example currently has the link below: raw datafile ...but the relevant data is very hard to find from that portal.

I believe this is a direct link to the dataset used in the example: direct link source ...note that after decompressing the .tar.gz archive, the folder needs to be renamed from _filtered_matricesmex to _zheng17_filtered_matricesmex and moved to a subfolder called "data" to match the code in Example Usage 3.

Finding the annotations

Also, the annotations file needs to be downloaded from the link in Example Usage 2 and renamed to match the code in Example Usage 3: annotations_zhang.txt --> data/zheng17_annotations.txt

Fixing some typos

The example code mentions "adata2" but should be "adata":

# BAD CODE:
# pre-process as per Zheng et al., and take first 50 PCs for analysis
sc.pp.recipe_zheng17(adata)
sc.tl.pca(adata, n_comps=50)
# setting small_pop to 50 cleans up some of the smaller clusters, but can also be left at the default 10
parc1 = parc.PARC(adata2.obsm['X_pca'], true_label = annotations, jac_std_global=0.15, random_seed =1, small_pop = 50)  
parc1.run_PARC() # run the clustering
parc_labels = parc1.labels
adata2.obs["PARC"] = pd.Categorical(parc_labels)

should be:

# GOOD CODE:
# pre-process as per Zheng et al., and take first 50 PCs for analysis
sc.pp.recipe_zheng17(adata)
sc.tl.pca(adata, n_comps=50)
# setting small_pop to 50 cleans up some of the smaller clusters, but can also be left at the default 10
parc1 = parc.PARC(adata.obsm['X_pca'], true_label = annotations, jac_std_global=0.15, random_seed =1, small_pop = 50)  
parc1.run_PARC() # run the clustering
parc_labels = parc1.labels
adata.obs["PARC"] = pd.Categorical(parc_labels)

Adding some missing steps for scanpy UMAP

# OLD CODE:
//visualize
sc.pl.umap(adata, color='annotations')
sc.pl.umap(adata, color='PARC')
# NEW CODE (includes some missing steps to allow scanpy to calculate a UMAP embedding)
# visualize
sc.settings.n_jobs=4
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)
sc.pl.umap(adata, color='annotations')
sc.pl.umap(adata, color='PARC')

# Ignore the "no transformation for parallel execution was possible" warnings

This code should produce the following output:

Embedding a total of 2 separate connected components using meta-embedding (experimental)
  n_components
# and two pretty plots

My final script is attached: PARCdemo3.txt

ShobiStassen commented 4 years ago

Thanks again Erin, I've cleaned up the links to the annotations/datafiles. Hope PARC is working for you

esimonds commented 4 years ago

Cool! Yep, PARC is working well for me. I successfully ran the three Example Usage demo datasets (after fixing the bugs above). I'm now integrating PARC into my existing CyTOF analysis pipeline so I can try it out on some of my own data. Thanks for all of your hard work on this!