Hoohm / CITE-seq-Count

A tool that allows to get UMI counts from a single cell protein assay
https://hoohm.github.io/CITE-seq-Count/
MIT License
77 stars 44 forks source link

different results each the CITE-seq count is run #165

Open colin986 opened 2 years ago

colin986 commented 2 years ago

Hi,

I'm getting a different output each time CITE-seq count is run. My whitelist and parameters do not each each time.

Is this expected? Is there anyway to control this in terms of reproducibility (i.e. setting a seed) ?

Thanks, Colin

Hoohm commented 2 years ago

Hey Colin, This is really strange as there is no randomness in the code, it should pretty much be the exact same output each time for the same parameters.

Could you show me some examples?

On Fri, 11 Mar 2022, 22:05 colin986, @.***> wrote:

Hi,

I'm getting a different output each time CITE-seq count is run. My whitelist and parameters do not each each time.

Is this expected? Is there anyway to control this in terms of reproducibility (i.e. setting a seed) ?

Thanks, Colin

— Reply to this email directly, view it on GitHub https://github.com/Hoohm/CITE-seq-Count/issues/165, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVO2CYBEN5O25JQ467P33U7OYQXANCNFSM5QQV3YFA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

colin986 commented 2 years ago

Hi Hoohm,

Thanks for coming back to me.

You were right. The CITE-seq count output is the same each time.

The variation in the result seems to come from the HTODemux function in Seurat when using clara clustering option (When using kmeans clustering the output is consistent). The result changes each time I run CITE-Seq count. The function has an option to set the seed, but I've still found that the output changes each time. So what I mean here is that HTODemux is reproducible with the same CITE-seq count output. CITE-seq count is also reproducible. However, when I re-run CITE-Seq count and HTODemux I get a different result - I don't understand why this is happening.

I know HTODemux draws 100 samples from the dataset for clara clustering - I wonder if during the CITE-Seq count the samples, while the same, the data are written in a a different order and the 100 samples are drawn in a different order - and that gives rise to variability in the output?

Thanks, Colin

johnyaku commented 1 week ago

I can verify "different" CITE-seq-count results on different runs.

The difference is in the column order, not in the actual content of the count matrices. Reordering the columns to match each other (or the whitelist) results in identical matrices.

I haven't been able to pin down the source of the variation. I can't see any random functions. Initially I suspected parallelization, with different chunks finishing in different orders depending on the run, but the problem persists even with only one thread.

This difference in ordering produces different assignments from Seurat::HTODemux() when kfunc='clara' (the default). In the good quality dataset where I have been testing this, assignments are different for about 5% of total barcodes. In a low or even medium quality dataset I suspect the variability might be worse.

I haven't looked at why, but @colin986's suggestion that different ordering might produce different sampling (even with the same seed) seems plausible to me.

Setting kfunc = 'kmeans' results in consist demux assignments, despite the difference in ordering.

For now I am reordering CITE-seq-count outputs based on the whitelist, and also using kmeans rather than clara.

Hoohm commented 1 week ago

Thank you for looking into this. I was afraid there was a bug I missed in my code but the downstream issues seem more plausible. Btw, if you are interested to test it out, I have a beta branch rewritten in Polars that is available. Some inputs names have changed but it should overall decrease memory usage and improve speeds.

johnyaku commented 1 week ago

Thanks @Hoohm. I'll check out the beta branch when I get a moment.

I'm not sure if it is worth making a feature request, but I do think it would be helpful if CITE-seq-count produced identical output for identical input (including sort order).