biocore / mmvec

Neural networks for microbe-metabolite interaction analysis
BSD 3-Clause "New" or "Revised" License
117 stars 51 forks source link

Reproducing the example data (cystic fibrosis) #114

Closed mstrazar closed 4 years ago

mstrazar commented 4 years ago

Hi,

I am trying to reproduce Fig. 4a in the paper. I have run the example script as state in the README and plotted the results of the biplot. The procedure was repeated two times.

I have then read the output of latent_dim_3_input_prior_1.00_output_prior_1.00_beta1_0.90_beta2_0.95_ordination.txt file and costructed a matrix U for the microbes and V for metabolites. Both matrices were centered and scaled by column (i.e. z-score) following your notebooks. The code is [here].(https://github.com/biocore/mmvec/files/3962886/biplot.r.zip)

The figures and the first five rows of V are shown below. They appear different between the two; On the first figure, some of NHQ and NQNO appear together, while on the second they don't.

Have you implemented a mechanism to monitor the progress of the learning algorithm (e.g. learning curves)? How to know if the learning succeeded?

Your feedback is kindly appreciated. I am not trying to undermine the software - my interest is in assuring that the results are coherent, as they can be used to make expensive subsequent steps in terms of both money and time.

Kind regards, Martin

biplot_1

X290.0883mz60.1277 -0.15  0.08 -0.08
X291.0489mz61.9903 -0.07  0.32 -0.22
X254.1601mz62.7917 -0.25 -0.35  1.92
X265.1154mz65.0188 -0.15 -0.42  1.87
X118.0839mz71.5519  1.29 -0.62 -1.26
X127.0605mz78.9317 -0.08  0.50 -0.54

biplot_2

X290.0883mz60.1277  0.21 -0.10  0.11
X291.0489mz61.9903 -0.07 -0.09 -1.44
X254.1601mz62.7917  0.18 -0.01 -0.38
X265.1154mz65.0188 -0.95 -0.04 -0.85
X118.0839mz71.5519  0.53  0.04  0.47
X127.0605mz78.9317 -0.87 -1.20  0.53
mortonjt commented 4 years ago

Good to see that you are digging into the examples.

What learning rate did you use (what was the command that you ran?). Did you use qiime or the standalone CLI?

If you ran the standalone CLI, you can monitor the learning curves with Tensorboard (see FAQs for more information).

It's a little weird that you got different results with the CF example. Chances are the model didn't reach convergence and needs to be run longer. I suggest to run with the standalone version. The parameters are essentially the same, but this would be running the mmvec command -- we do this because Tensorboard does not play nicely with qiime.

Another note is that we have binder tutorials here with the exact environment and scripts used to reproduce these results. That can also help with debugging.

mstrazar commented 4 years ago

I ran the command as stated in the current README. Tensorboard seems to be able to find the right files, but they are some problems parsing, or something is off. I am also getting a lot of deprecation warnings.

Could you try to reproduce this by installing the package in a clean virtual environment?

Screen Shot 2019-12-16 at 10 49 11 AM

(py3mm) wmcce-fc7:mmvec mstrazar$ mmvec paired-omics --microbe-file examples/cf/otus_nt.biom --metabolite-file examples/cf/lcms_nt.biom
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/biom/table.py:4049: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  for r in self.matrix_data.tocsr()]
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/biom/table.py:4052: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return constructor(mat, index=index, columns=columns)
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/pandas/core/sparse/frame.py:854: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  default_kind=self._default_kind,
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/pandas/core/frame.py:3471: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return klass(values, index=self.index, name=items, fastpath=True)
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/pandas/core/sparse/frame.py:785: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_arrays, index=index, columns=columns).__finalize__(
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/pandas/core/ops/__init__.py:1641: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_values, index=self.index, name=self.name)
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/pandas/core/sparse/frame.py:339: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  default_fill_value=self.default_fill_value,
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/pandas/core/generic.py:6289: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_data).__finalize__(self)
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/pandas/core/generic.py:5884: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_data).__finalize__(self)
/Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/pandas/core/generic.py:3606: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  result = self._constructor(new_data).__finalize__(self)
WARNING:tensorflow:From /Users/mstrazar/Dev/lib/mmvec/scripts/mmvec:156: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /Users/mstrazar/Dev/lib/mmvec/scripts/mmvec:157: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-12-16 10:40:16.154352: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
WARNING:tensorflow:From /Users/mstrazar/Dev/lib/mmvec/mmvec/multimodal.py:90: The name tf.log is deprecated. Please use tf.math.log instead.

WARNING:tensorflow:From /Users/mstrazar/Dev/lib/mmvec/mmvec/multimodal.py:91: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.random.categorical` instead.
WARNING:tensorflow:From /Users/mstrazar/Dev/lib/mmvec/mmvec/multimodal.py:102: The name tf.random_normal is deprecated. Please use tf.random.normal instead.

WARNING:tensorflow:From /Users/mstrazar/Dev/lib/mmvec/mmvec/multimodal.py:118: Normal.__init__ (from tensorflow.python.ops.distributions.normal) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/tensorflow/python/ops/distributions/normal.py:160: Distribution.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /Users/mstrazar/Dev/lib/mmvec/mmvec/multimodal.py:135: Multinomial.__init__ (from tensorflow.python.ops.distributions.multinomial) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /Users/mstrazar/Dev/lib/mmvec/mmvec/multimodal.py:183: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

WARNING:tensorflow:From /Users/mstrazar/Dev/lib/mmvec/mmvec/multimodal.py:185: The name tf.summary.histogram is deprecated. Please use tf.compat.v1.summary.histogram instead.

WARNING:tensorflow:From /Users/mstrazar/Dev/lib/mmvec/mmvec/multimodal.py:189: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

WARNING:tensorflow:From /Users/mstrazar/Dev/lib/mmvec/mmvec/multimodal.py:191: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

WARNING:tensorflow:From /Users/mstrazar/Dev/lib/mmvec/mmvec/multimodal.py:196: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From /Users/mstrazar/Dev/py3mm/lib/python3.6/site-packages/tensorflow/python/ops/clip_ops.py:286: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From /Users/mstrazar/Dev/lib/mmvec/mmvec/multimodal.py:206: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From /Users/mstrazar/Dev/lib/mmvec/mmvec/multimodal.py:243: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

100%|███████████████████████████████████████████████████████████████████| 1222/1222 [00:01<00:00, 913.14it/s
mortonjt commented 4 years ago

Your run ran in 1 second - that is not enough time to write any summaries.

I'd set summary-interval to 1, then bump up your epochs (i.e. 1000). The deprecation warnings are expected, and should be fine as long as the correct versions are installed.

If you have actual errors, feel free to post the exact command that you used.

Also, regarding the previous plots, those coordinates should not be z-score transformed. If you want to dig exactly how the figures are generated, that code can be reproduced here: https://github.com/knightlab-analyses/multiomic-cooccurrences

Otherwise, the emperor plots are recommended.

mstrazar commented 4 years ago

Hi Jamie, thank you for your time and prompt responses.

I found the run script in the multiomic-occurences repository with modified optimization parameters. It would make sense to point this out in the instructions. I had to use 6000 epochs (instead of suggested 1000) to reach what looks like convergence.

Still, a log-loss of the order 1e9 seems huge, doesn't it? What kind of numbers did you originally get?

I manage to reproduce the plot when I pull the file results/cf_output/omics-biplot.results. However, my ordination coordinates appear way off, likely as a consequence of convergence issues - strange local minima.

tensorboard_log

Published results make sense, without scaling:

biplot

What I get: biplot_unscaled

mortonjt commented 4 years ago

@mstrazar , ok. Can you provide the exact command that you ran? It's not clear to me what learning rate you used or what batch size you used.

Rather than trying to plot it yourself, have you tried using Emperor as suggested in the tutorial? That was designed to help with some of the scaling issues that you are observing.

The log-loss looks about right (its summing over quite a few microbe-metabolite pairs, so it is expected to be large). Local minima shouldn't be a major problem (there's only 1 layer, its a biconvex optimization).

mstrazar commented 4 years ago

Hi,

this is the command. It is exactly as provided in (https://github.com/knightlab-analyses/multiomic-cooccurrences), except that the number of epochs was increased to 6000.

# Parameters that you will need to tune sorted in terms of priority
microbe_file=../examples/cf/otus_nt.biom
metabolite_file=../examples/cf/lcms_nt.biom
latent_dim=3
learning_rate=1e-5
epochs=6000
outprior=1
inprior=1
beta1=0.85
beta2=0.90
batch_size=1000

RESULTS_DIR=summary/
mmvec paired-omics \
      --microbe-file $microbe_file \
      --metabolite-file $metabolite_file \
      --min-feature-count 10 \
      --num-testing-examples 10 \
      --learning-rate $learning_rate \
      --beta1 $beta1 \
      --beta2 $beta2 \
      --summary-dir $RESULTS_DIR \
      --epochs $epochs \
      --summary-interval 10 \
      --checkpoint-interval 3600 \
      --batch-size $batch_size \
      --latent-dim $latent_dim \
      --input-prior $inprior \
      --output-prior $outprior
      # --top-k 10 \
      # --threads 1 \
mortonjt commented 4 years ago

Just to make sure things are running as expected, I reran the script (bumped down the batch size to 100 in order to get more iterations in). The results are reproduced at the end.

I'm guessing that the discrepancies that you are observing is due to the preprocessing of the SVD. You can get different results depending on how you do SVD (i.e. you could multiply your U matrix by your eigenvalues, or multiply your V matrix by your eigenvalues). We actually have a tutorial on how to do this yourself in python here.

But before you dive into manually computing biplot coordinates - I'd first try to run the out-of-the-box tools and try to get a sane visualization of the biplot through emperor. Will be happy to elaborate once that is done.

Tensorboard results

Screenshot 2019-12-17 15 08 14

The emperor biplots with taxa / molecules are highlighted below

Screenshot 2019-12-17 15 06 53 Screenshot 2019-12-17 15 06 41
mortonjt commented 4 years ago

Before I forget - if there are any major gapping holes in the README / documentation, we warmly welcome any contributions. Almost all of the user facing documentation has been created by users learning how to use mmvec.

mstrazar commented 4 years ago

Hi, we did not do the 3D plots and would just like to be able to (step-by-step) reproduce what was published, before proceeding with analyses of our own data.

Our plot is fine, since it exactly reproduces the published one, given correct inputs (U and V). The problem should not be in the SVD, because this is a deterministic method. We are not changing anything anywhere, just plotting "ordination results" that come from the pipeline.

Given that the work was published in a very high profile venue, and will with all likelihood be considered 'state-of-the-art', things should work out of the box. The fact that the instructions were written by 'users learning' suggests that reproducibility might not have been a top priority.

Kind regards, Martin Stražar

mortonjt commented 4 years ago

ok cool. Then consider this issue closed then. Feel free to reopen if this has not addressed your concerns.

mstrazar commented 4 years ago

Hi Jamie,

first of all, I would like to apologize for harsh comments. It was inappropriate and to nobody's benefit.

The ordination results in the paper have undergone scaling, as I've noticed by computing the norms. This is skipped in the python code that comes with mmvec, and U is indeed multiplied by singular values, as you pointed out. I now get sensible results by scaling the SVD results. I have used the square root of the singular values. Here is a pull request that implements this scaling and can hopefully make usage of mmvec more straightforward.

https://github.com/biocore/mmvec/pull/115

I genuinely believe the modelling assumptions behind conditional probabilities are reasonable and make much more sense than what we had so far (correlation approaches, etc.). If only we were able to make deep learning models more consistent between runs, it would be awesome.

Kind regards, Martin

biplot_replicate_1

biplot_replicate_2