biocore / songbird

Vanilla regression methods for microbiome differential abundance analysis
BSD 3-Clause "New" or "Revised" License
54 stars 25 forks source link

ENH: Set random seeds #101

Closed gwarmstrong closed 4 years ago

gwarmstrong commented 4 years ago

Songbird random seeds

Current behavior:

When I run:

$ songbird multinomial \
    --input-biom data/redsea/redsea.biom \
    --metadata-file data/redsea/redsea_metadata.txt \
    --formula "1" \
    --epochs 1000 \
    --differential-prior 0.5 \
    --summary-interval 1 \
    --summary-dir logdir/model1
$ songbird multinomial \
    --input-biom data/redsea/redsea.biom \
    --metadata-file data/redsea/redsea_metadata.txt \
    --formula "1" \
    --epochs 1000 \
    --differential-prior 0.5 \
    --summary-interval 1 \
    --summary-dir logdir/model2

I get something like the following, where the cv-error converges to different values and the training error is updated differently:

Screen Shot 2019-12-16 at 2 20 06 PM

Further exploration:

So I provided an option to fix the seed for making train/test splits:

$ songbird multinomial \
    --input-biom data/redsea/redsea.biom \
    --metadata-file data/redsea/redsea_metadata.txt \
    --formula "1" \
    --epochs 1000 \
    --differential-prior 0.5 \
    --summary-interval 1 \
    --summary-dir logdir/model3 \
    --seed 42
$ songbird multinomial \
    --input-biom data/redsea/redsea.biom \
    --metadata-file data/redsea/redsea_metadata.txt \
    --formula "1" \
    --epochs 1000 \
    --differential-prior 0.5 \
    --summary-interval 1 \
    --summary-dir logdir/model4 \
    --seed 42

Screen Shot 2019-12-13 at 10 47 00 AM

And now the models converge to approximately the same value. However, as there a still variations in their training updates. So I provided a way to fix the seed that tensorflow uses as well.

$ songbird multinomial \
    --input-biom data/redsea/redsea.biom \
    --metadata-file data/redsea/redsea_metadata.txt \
    --formula "1" \
    --epochs 1000 \
    --differential-prior 0.5 \
    --summary-interval 1 \
    --summary-dir logdir/model5 \
    --split-seed 42 \
    --tf-seed 22
$ songbird multinomial \
    --input-biom data/redsea/redsea.biom \
    --metadata-file data/redsea/redsea_metadata.txt \
    --formula "1" \
    --epochs 1000 \
    --differential-prior 0.5 \
    --summary-interval 1 \
    --summary-dir logdir/model6 \
    --split-seed 42 \
    --tf-seed 22

And now it is identical between runs: Screen Shot 2019-12-13 at 11 12 50 AM

Proposed behavior:

When the above commands are run, I should at least get the same CV score, but furthermore, it should be possible to fix the entire training process.

In order to fix this I propose the following solution:

Summary of changes:

gwarmstrong commented 4 years ago

Hi @gwarmstrong , this is awesome.

What do you think about only having 1 seed passed in? I'm assuming that two seeds are defined due to numpy / TF differences. But you can imagine the following should work

def multinomial(..., seed):
    np.random.seed(seed)
    tf.set_random_seed(seed)

It'll be more ideal to only specify 1 seed to try to minimize the number of available options to the users.

Sure, I thought it might be nice to be able to adjust them independently. This could be helpful for, e.g., verifying that you get a somewhat consistent error minimum across different a random initialization, but with the same train/test split.

Given that the loss here is convex with respect to the parameters, my idea is probably less necessary and we can change to a single seed.

gwarmstrong commented 4 years ago

@mortonjt this should reflect the changes you suggested now

gwarmstrong commented 4 years ago

@mortonjt just a polite bump on this PR! Also noting that this issue has subsequently come up with multiple people within Knight Lab

gwarmstrong commented 4 years ago

If you run this command on a non-zero seed and post your tensorboard results on the example dataset (i.e. redsea) I'll can double check on my machine to make sure those results are reproducible.

Does the third tensorboard output in the original description work? Command and tensorboard screenshot is included. Or do you need additional information?

mortonjt commented 4 years ago

Yes, the output in the original tensorboard would work - but the interface changed right?

gwarmstrong commented 4 years ago

Ah right, that is when the seeds could be set separately. Sure, I will re-run and upload.

gwarmstrong commented 4 years ago

Try this on for size

$ songbird multinomial \
    --input-biom data/redsea/redsea.biom \
    --metadata-file data/redsea/redsea_metadata.txt \
    --formula "1" \
    --epochs 1000 \
    --differential-prior 0.5 \
    --summary-interval 1 \
    --summary-dir logdir/new_model \
    --random-seed 42

Screen Shot 2020-01-07 at 5 32 18 PM

mortonjt commented 4 years ago

I was able to run and reproduce the results

image

Once the last comment regarding the qiime2 seed is done and verified to work, I can merge this in.

lisa55asil commented 4 years ago

I think these changes are great and strongly agree that we should have a default seed set for the q2 plugin. As you mentioned if people want change or remove the seed they can use the standalone version, but it is critical that q2 produces reproducible results.

I'm going to update the readme to make it more clear: 1) how and why to use the null/baseline model 2) when to use the --p-training-column

Thanks for the awesome updates George!!

mortonjt commented 4 years ago

Thanks @lisa55asil . I'm going to merge this in. @gwarmstrong thanks!