Seed argument for publication reproduciblity

t-carroll commented 2 years ago

Hi Tinyi,

I've had some issues generating reproducible results in preparation for publishing code, namely that I can't same to reproduce the same result, even when setting the seed explicitly. For instance, running the same call with the same seed on some previously published data from Maag et al. shows two different results:

set.seed(42) ted.maag = run.Ted(ref.dat = group, X = maag2,cell.type.labels = labs, cell.subtype.labels = subs, n.cores = 24, tum.key = "EAC", input.type = "scRNA") set.seed(42) ted.maag2 = run.Ted(ref.dat = group, X = maag2,cell.type.labels = labs, cell.subtype.labels = subs, n.cores = 24, tum.key = "EAC", input.type = "scRNA")

all.equal(ted.maag$res$final.gibbs.theta,ted.maag2$res$final.gibbs.theta)

[1] "Mean relative difference: 0.001463005"

all.equal(ted.maag$res$final.gibbs.theta[,"EAC"],ted.maag2$res$final.gibbs.theta[,"EAC"])

[1] "Mean relative difference: 0.0006727291"

So it looks like setting the seed for the global R environment is insufficient, perhaps due to some quirk of the multicore parallelism (I'm doing this in Rstudio on a CentOS Linux HPC, if relevant). Is there a way to set the seed internally for whichever internal functions are sampling in order to try and enable reproducible results? If so, perhaps an optional seed argument could be passed to run.Ted(), which could then in turn be passed to these internal sampling functions. Do you think that sort of thing be feasible? Happy to try and help out if so (but not as familiar with the workings of these internal fucntions)

tinyi commented 2 years ago

Hi Tom,

I have updated the github to address the issue of reproducibility. Simply reinstall the package and add seed=your seed number in the run.Ted argument.

Please let me know if you have any questions.

Best,

Tinyi

On Sat, Apr 9, 2022 at 6:53 PM Tom Carroll @.***> wrote:

Hi Tinyi,

I've had some issues generating reproducible results in preparation for publishing code, namely that I can't same to reproduce the same result, even when setting the seed explicitly. For instance, running the same call with the same seed on some previously published data from Maag et al. shows two different results:

set.seed(42) ted.maag = run.Ted(ref.dat = group, X = maag2,cell.type.labels = labs, cell.subtype.labels = subs, n.cores = 24, tum.key = "EAC", input.type = "scRNA") set.seed(42) ted.maag2 = run.Ted(ref.dat = group, X = maag2,cell.type.labels = labs, cell.subtype.labels = subs, n.cores = 24, tum.key = "EAC", input.type = "scRNA") all.equal(ted.maag$res$final.gibbs.theta,ted.maag2$res$final.gibbs.theta)

[1] "Mean relative difference: 0.001463005"

all.equal(ted.maag$res$final.gibbs.theta[,"EAC"],ted.maag2$res$final.gibbs.theta[,"EAC"]) [1] "Mean relative difference: 0.0006727291"

So it looks like setting the seed for the global R environment is insufficient, perhaps due to some quirk of the multicore parallelism (I'm doing this in Rstudio on a CentOS Linux HPC, if relevant). Is there a way to set the seed interanlly for whichever internal functions are sampling in order to try and enable reproducible results? If so, perhaps an optional seed argument could be passed to run.Ted(), which could then in turn be passed to these internal sampling functions. Do you think that sort of thing be feasible? Happy to try and help out if so (but not as familiar with the workings of these internal fucntions)

— Reply to this email directly, view it on GitHub https://github.com/Danko-Lab/TED/issues/17, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4NHSYIJ7TPX6ZRIQPYJILVEIC6LANCNFSM5S75HEPA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

t-carroll commented 2 years ago

Hi Tinyi,

Great, thanks for the quick upgrade! I tried it out and all.equal(ted.maag$res$final.gibbs.theta,ted.maag2$res$final.gibbs.theta) now equals TRUE when setting the same seed in both calls. Now closing.

-Tom

Danko-Lab / TED

Seed argument for publication reproduciblity #17