IGS / gEAR

The gEAR Portal was created as a data archive and viewer for gene expression data including microarrays, bulk RNA-Seq, single-cell RNA-Seq and more.
https://umgear.org
GNU Affero General Public License v3.0
10 stars 5 forks source link

tSNE/UMAP don't look right in the sc workbench #386

Closed beamilon closed 1 year ago

beamilon commented 1 year ago

@jorvis @adkinsrs @songeric1107 I used E14, mouse, scRNA-seq, cochlear epithelium (Kelley) and started a new analysis. The clustering looks horrible, not like a clustering at all, everything seems to overlap. I noticed that when I prepared the slides for the EARssentials workshop but didn't think much of it at the time. Katie noticed the same thing with different parameters, different datasets even her own. We may add more examples tomorrow. image

songeric1107 commented 1 year ago

@beamilon , that is what I mentioned before, the problem is caused by the normalized values. all the Kelly datasets are normalized values, not raw values. so it is not appropriate to use sc workbench to re-nomoralize the log normalized values. you should use the primary analysis for those datasets.

jorvis commented 1 year ago

This is an ongoing discussion on updating the datasets which are normalized and which aren't, then disabling parts of the interface where they shouldn't be used.

beamilon commented 1 year ago

Does it make sense that the tSNE function (instead of UMAP) works great? I know these are 2 different methods but why would one be so messed up and not the other one (top image). I also used the raw datasets from Jan and when choosing the UMAP, the result is really not good (middle image). Choosing tSNE instead bring up something expected for clustering (bottom image). I don't think the raw versus normalized matrix is the problem. Something is wrong with the UMAP function and it was not the case before. image image image

songeric1107 commented 1 year ago

@jorvis , I compare the saved analysis from before and a new one that I tried just now, I agree with @beamilon that there might be a bug for umap display which leads to different UMAP display although I use the same parameter.

test dataset with raw values https://umgear.org/analyze_dataset.html?dataset_id=e084843c-32b0-4551-7307-0942eaa45756

saved analysis from before: Screen Shot 2022-07-28 at 10 48 50 AM

new analysis today:

Screen Shot 2022-07-28 at 10 51 50 AM

adkinsrs commented 1 year ago

We haven't really messed with how these plots are generated, but I'll take a look today. I'm going to run this analysis in my local copy of gEAR since it would be isolated from anything in main gEAR

adkinsrs commented 1 year ago

I believe this is definitely a scanpy issue. See https://github.com/scverse/scanpy/issues/2291 for examples.

This person was using scanpy v1.8.1 and I am using 1.7.2 to get similar findings as the plots above

beamilon commented 1 year ago

Yes, that is definitely it. Was the scanpy version related to gEAR updated recently?

adkinsrs commented 1 year ago

I believe that scanpy was last updated around May this year, around the time of the workshop then. I remember @jorvis having to downgrade scanpy from ~1.8.1 to 1.7.2 due to some other reported issues (#318)

adkinsrs commented 1 year ago
Screen Shot 2022-07-29 at 11 16 55 AM

Our codebase (scanpy 1.7.2) has this line, which sets n_epochs to 0 if the "maxiter" argument is not provided. This n_epochs=0 is what is causing the bad umap plots.

https://github.com/scverse/scanpy/blame/8c0764243675fc25cbcbefe66e6555e083993956/scanpy/tools/_umap.py#L192

Looking at the scanpy code, I see they changed something in the umap calculations script 9 months ago that addressed a separate issue, and I believe that upgrading scanpy would resolve this particular issue for us. The scanpy.tl.umap function now sets the default n_epochs to 500 or 200 depending on the size of the nearest-neighbors connectivity table created by scanpy.pp.neighbors. Source -> https://github.com/scverse/scanpy/blob/41a7b830acb0c05ca4cbf0bea97e3fa17545f12c/scanpy/tools/_umap.py#L192-L193

I set maxIter = 200 and got the UMAP plot above, and I can get a decently similar plot if I set it to 500 as well. However if I set maxiter to 0, 1, or some very low number I get plots similar to what is being reported. For now I am adding "maxiter=500" to the scanpy.tl.umap function and when we ever update scanpy back to a new version I will explore removing this option (which is not a necessity)

adkinsrs commented 1 year ago

edited gear-prod directly to fix this issue in this file, since the workshop is approaching

bsierieb1 commented 1 year ago

can confirm that setting maxiter=500 in scanpy.tl.umap fixed the issue for me