simulate and preprocess

vdet commented 3 years ago

Hi Kevin,

I am confused about 'simulate' and 'process'. Actually, a simple end-to-end example starting from count matrices up to prediction with examples for all input/intermediate/output files would help a lot.

Practically, my inputs are

test_celltypes.txt ##celltypes of single cells
test_counts.txt ##raw read counts of single cells
bulk.txt ##raw counts of by bulk rna-seq (3' sequencing no need for gene length norm), genes are the same and in the same order as in test_counts.txt

Is this the correct pipeline?

scaden simulate --cells 100 --n_samples 32000 --data ./ --pattern '*_counts.txt' 
scaden train data.h5ad --steps 20000
scaden predict bulk.txt

This runs without error messages but the predictions are grossly wrong, so wondered if I missed something. The single cell and bulk come from the same piece of tissue, so I am 100% sure the single cells match the cells within the bulks.

I was surprised that replacing 'bulk.txt' by its log2-transformed and [0,1]-scaled version yield the exact same prediction. Is that expected?

I tried to run 'process':

scaden process data.h5ad bulk.txt

it completed, but then 'train' generated an error (see below). Is 'process' actually needed after 'simulate' if my single cell bulk matrix has the same genes in the same order?

Another question: my bulks are actually mini bulks with 10-100 cells and super low coverage. What parameters would you advise for 'simulate' ?

Again, thank you so much for your help

Vincent

scaden train processed.h5ad 
2020-12-18 18:32:19.194073: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2020-12-18 18:32:19.194143: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

     ____                _            
    / ___|  ___ __ _  __| | ___ _ __  
    \___ \ / __/ _` |/ _` |/ _ \ '_ \ 
     ___) | (_| (_| | (_| |  __/ | | |
    |____/ \___\__,_|\__,_|\___|_| |_|

Training M256 Model ...
2020-12-18 18:32:20.448871: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-12-18 18:32:20.448929: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2020-12-18 18:32:20.448954: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (7603195280b0): /proc/driver/nvidia/version does not exist
2020-12-18 18:32:20.449306: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-18 18:32:20.456092: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3312000000 Hz
2020-12-18 18:32:20.456259: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5af1120 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-18 18:32:20.456304: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/iterator_ops.py:353: Iterator.output_types (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_types(iterator)`.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/iterator_ops.py:354: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(iterator)`.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/iterator_ops.py:356: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_classes(iterator)`.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/scaden/model/scaden.py:53: dense (from tensorflow.python.keras.legacy_tf_layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.Dense instead.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/legacy_tf_layers/core.py:187: Layer.apply (from tensorflow.python.keras.engine.base_layer_v1) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/scaden/model/scaden.py:54: dropout (from tensorflow.python.keras.legacy_tf_layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
2020-12-18 18:32:33.561962: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 969984000 exceeds 10% of free system memory.
2020-12-18 18:32:33.833938: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 969984000 exceeds 10% of free system memory.
Model parameters restored successfully
  0%|                                                                     | 0/5000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1349, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1441, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Matrix size-incompatible: In[0]: [128,7578], In[1]: [11994,256]
     [[{{node scaden_model/dense1/MatMul}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/scaden", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/scaden/__main__.py", line 31, in main
    cli()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/scaden/__main__.py", line 90, in train
    training(data_path=data_path,
  File "/usr/local/lib/python3.8/dist-packages/scaden/scaden/training.py", line 63, in training
    cdn256.train(input_path=data_path, train_datasets=train_datasets)
  File "/usr/local/lib/python3.8/dist-packages/scaden/model/scaden.py", line 285, in train
    _, loss, summary = self.sess.run([self.optimizer, self.loss, self.merged_summary_op])
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 957, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1180, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1358, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Matrix size-incompatible: In[0]: [128,7578], In[1]: [11994,256]
     [[node scaden_model/dense1/MatMul (defined at /lib/python3.8/dist-packages/scaden/model/scaden.py:53) ]]

Errors may have originated from an input operation.
Input Source operations connected to node scaden_model/dense1/MatMul:
 IteratorGetNext (defined at /lib/python3.8/dist-packages/scaden/model/scaden.py:234)

Original stack trace for 'scaden_model/dense1/MatMul':
  File "/bin/scaden", line 8, in <module>
    sys.exit(main())
  File "/lib/python3.8/dist-packages/scaden/__main__.py", line 31, in main
    cli()
  File "/lib/python3.8/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/lib/python3.8/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/lib/python3.8/dist-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/lib/python3.8/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/lib/python3.8/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/lib/python3.8/dist-packages/scaden/__main__.py", line 90, in train
    training(data_path=data_path,
  File "/lib/python3.8/dist-packages/scaden/scaden/training.py", line 63, in training
    cdn256.train(input_path=data_path, train_datasets=train_datasets)
  File "/lib/python3.8/dist-packages/scaden/model/scaden.py", line 265, in train
    self.build_model(input_path=input_path, train_datasets=train_datasets, mode="train")
  File "/lib/python3.8/dist-packages/scaden/model/scaden.py", line 245, in build_model
    self.logits = self.model_fn(X=self.x, n_classes=self.n_classes)
  File "/lib/python3.8/dist-packages/scaden/model/scaden.py", line 53, in model_fn
    layer1 = tf.compat.v1.layers.dense(X, units=self.hidden_units[0], activation=activation , name="dense1")
  File "/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/lib/python3.8/dist-packages/tensorflow/python/keras/legacy_tf_layers/core.py", line 187, in dense
    return layer.apply(inputs)
  File "/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/lib/python3.8/dist-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 1701, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/lib/python3.8/dist-packages/tensorflow/python/keras/legacy_tf_layers/base.py", line 547, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/lib/python3.8/dist-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 776, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 255, in wrapper
    return converted_call(f, args, kwargs, options=options)
  File "/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 532, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 339, in _call_unconverted
    return f(*args, **kwargs)
  File "/lib/python3.8/dist-packages/tensorflow/python/keras/layers/core.py", line 1193, in call
    return core_ops.dense(
  File "/lib/python3.8/dist-packages/tensorflow/python/keras/layers/ops/core.py", line 53, in dense
    outputs = gen_math_ops.mat_mul(inputs, kernel)
  File "/lib/python3.8/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 5640, in mat_mul
    _, _, _op, _outputs = _op_def_library._apply_op_helper(
  File "/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 742, in _apply_op_helper
    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
  File "/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3477, in _create_op_internal
    ret = Operation(
  File "/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 1949, in __init__
    self._traceback = tf_stack.extract_stack()

KevinMenden commented 3 years ago

Hi Vincent,

your pipeline is almost correct, but the process step is absolutely necessary. It ensures that the same genes are used but more importantly, applies normalization and scaling methods to the data, which is crucial for successful machine learning.

I would also advise to opt for 5000 steps of training, which is the value that I found works best for most scenarios.

You should not log-transform the count data, as Scaden applies some processing to that as well internally, making sure it is on the same scale as the training data. That you get the same results is interesting, but might be caused be what I just said (the data will have different values but adhere to the same distribution, leading to the same predictions in the end).

The reason why it didn't work for your training run is that you (accidentally) loaded the trained models before. If you have run Scaden in the same directory as before, it will look for pre-trained models and load them (maybe I should turn that behavior off I'm now thinking?).

It tells you here: Model parameters restored successfully

So best specifiy your model directory with --model_dir my_model during training and prediction. Then it should work! (Your old model was trained on data of a different shape, by the way, causing the problem).

I have been working on a helper function for Scaden to make the training process more clear, but it's currently only available in the development version. So if you want an end to end pipeline, you can clone the repo, check-out the development branch, cd into the directory and install with pip install scaden.

Then you can run scaden example which will generate three small example files in your directory:

example_counts.txt
example_celltypes.txt
example_bulk_data.txt

You can use them for a complete simulate - process - train -predict workflow! But I'm quite optimistic that it will work with your data as well :-)

Don't hesitate to come back with questions in case you still encounter issues!

Cheers, Kevin

vdet commented 3 years ago

Thanks so much Kevin. I know result that start to look better. Vincent

KevinMenden / scaden

simulate and preprocess #59