apache / submarine

Submarine is Cloud Native Machine Learning Platform.
https://submarine.apache.org/
Apache License 2.0
697 stars 252 forks source link

SUBMARINE-1312. Fix submarine-sdk not connecting to the database #991

Closed cdmikechen closed 2 years ago

cdmikechen commented 2 years ago

What is this PR for?

The istio proxy intercepts some of the traffic to the database, causing the submarine-sdk within the pod to be unable to connect to the database. The main purpose of this PR is to remove istio's sidecar for the database.

What type of PR is it?

Bug Fix

Todos

What is the Jira issue?

https://issues.apache.org/jira/browse/SUBMARINE-1312

How should this be tested?

This PR can be tested by quickstart image.

root@test:/opt# python train.py 
OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k
2022-08-27 07:22:54.266967: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-08-27 07:22:54.267033: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
WARNING:tensorflow:From train.py:61: _CollectiveAllReduceStrategyExperimental.__init__ (from tensorflow.python.distribute.collective_all_reduce_strategy) is deprecated and will be removed in a future version.
Instructions for updating:
use distribute.MultiWorkerMirroredStrategy instead
From train.py:61: _CollectiveAllReduceStrategyExperimental.__init__ (from tensorflow.python.distribute.collective_all_reduce_strategy) is deprecated and will be removed in a future version.
Instructions for updating:
use distribute.MultiWorkerMirroredStrategy instead
2022-08-27 07:22:56.427482: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-08-27 07:22:56.427512: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-08-27 07:22:56.427560: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (test): /proc/driver/nvidia/version does not exist
2022-08-27 07:22:56.428097: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:tensorflow:Single-worker MultiWorkerMirroredStrategy with local_devices = ('/device:CPU:0',), communication = CommunicationImplementation.AUTO
Single-worker MultiWorkerMirroredStrategy with local_devices = ('/device:CPU:0',), communication = CommunicationImplementation.AUTO
Generating dataset mnist (~/tensorflow_datasets/mnist/3.0.1)
Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to ~/tensorflow_datasets/mnist/3.0.1...
Dl Completed...: 0 url [00:00, ? url/s]          Downloading https://storage.googleapis.com/cvdf-datasets/mnist/t10k-images-idx3-ubyte.gz into /root/tensorflow_datasets/downloads/cvdf-datasets_mnist_t10k-images-idx3-ubytedDnaEPiC58ZczHNOp6ks9L4_JLids_rpvUj38kJNGMc.gz.tmp.b8ba5e1c295746a7947775f54e76fe5b...
Dl Completed...:   0%|                           Downloading https://storage.googleapis.com/cvdf-datasets/mnist/t10k-labels-idx1-ubyte.gz into /root/tensorflow_datasets/downloads/cvdf-datasets_mnist_t10k-labels-idx1-ubyte4Mqf5UL1fRrpd5pIeeAh8c8ZzsY2gbIPBuKwiyfSD_I.gz.tmp.0d6323e9d8684664b864a0d8e0d53cbc...
Dl Completed...:   0%|                           Downloading https://storage.googleapis.com/cvdf-datasets/mnist/train-images-idx3-ubyte.gz into /root/tensorflow_datasets/downloads/cvdf-datasets_mnist_train-images-idx3-ubyteJAsxAi0QnOBEygBw_XW2X7zp-LBZAIqqYSHN8ru4ZO4.gz.tmp.63b70936c91b4e1da998240c6c546a8e...
Dl Completed...:   0%|                           Downloading https://storage.googleapis.com/cvdf-datasets/mnist/train-labels-idx1-ubyte.gz into /root/tensorflow_datasets/downloads/cvdf-datasets_mnist_train-labels-idx1-ubytedcDWkl3FO9T-WMEH1f1Xt51eIRmePRIMAk6X147Qw8w.gz.tmp.98a1eec4eaa249f9a15a57f584149cdc...
Extraction completed...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.38s/ file]
Dl Size...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:05<00:00,  1.82 MiB/s]
Dl Completed...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.38s/ url]
Generating splits...:   0%|                                                                                                                                                                                                                           | 0/2 [00:00<?, ? splits/sDone writing ~/tensorflow_datasets/mnist/3.0.1.incompleteMGWHS0/mnist-train.tfrecord*. Number of examples: 60000 (shards: [60000])                                                                                                                                                
Generating splits...:  50%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                         | 1/2 [00:15<00:15, 15.13s/ splitsDone writing ~/tensorflow_datasets/mnist/3.0.1.incompleteMGWHS0/mnist-test.tfrecord*. Number of examples: 10000 (shards: [10000])                                                                                                                                                 
Dataset mnist downloaded and prepared to ~/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.                                                                                                                                                               
Constructing tf.data.Dataset mnist for split None, from ~/tensorflow_datasets/mnist/3.0.1
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten (Flatten)            (None, 576)               0         
_________________________________________________________________
dense (Dense)                (None, 64)                36928     
_________________________________________________________________
dense_1 (Dense)              (None, 10)                650       
=================================================================
Total params: 93,322
Trainable params: 93,322
Non-trainable params: 0
_________________________________________________________________
2022-08-27 07:23:20.046835: W tensorflow/core/framework/dataset.cc:679] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
2022-08-27 07:23:20.048953: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/10
2022-08-27 07:23:30.542246: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:175] Filling up shuffle buffer (this may take a while): 7111 of 10000
2022-08-27 07:23:34.450357: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:228] Shuffle buffer filled.
70/70 [==============================] - 16s 20ms/step - loss: 1.8979 - accuracy: 0.3429 
{'loss': 1.8978824615478516, 'accuracy': 0.34285715222358704}
^CTraceback (most recent call last):
  File "train.py", line 88, in <module>
    main()
  File "train.py", line 84, in main
    multi_worker_model.fit(ds_train, epochs=10, steps_per_epoch=70, callbacks=[MyCallback()])
  File "/usr/local/lib/python3.7/site-packages/keras/engine/training.py", line 1230, in fit
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/usr/local/lib/python3.7/site-packages/keras/callbacks.py", line 413, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "train.py", line 81, in on_epoch_end
    submarine.log_metric("loss", logs["loss"], epoch)
  File "/usr/local/lib/python3.7/site-packages/submarine/tracking/fluent.py", line 54, in log_metric
    SubmarineClient().log_metric(job_id, key, value, worker_index, datetime.now(), step or 0)
  File "/usr/local/lib/python3.7/site-packages/submarine/tracking/client.py", line 58, in __init__
    self.store = utils.get_tracking_sqlalchemy_store(self.db_uri)
  File "/usr/local/lib/python3.7/site-packages/submarine/tracking/utils.py", line 93, in get_tracking_sqlalchemy_store
    return SqlAlchemyStore(store_uri)
  File "/usr/local/lib/python3.7/site-packages/submarine/store/tracking/sqlalchemy_store.py", line 58, in __init__
    insp = sqlalchemy.inspect(self.engine)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/inspection.py", line 64, in inspect
    ret = reg(subject)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/reflection.py", line 182, in _engine_insp
    return Inspector._construct(Inspector._init_engine, bind)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/reflection.py", line 117, in _construct
    init(self, bind)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/reflection.py", line 128, in _init_engine
    engine.connect().close()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 3315, in connect
    return self._connection_cls(self, close_with_result=close_with_result)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 96, in __init__
    else engine.raw_connection()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 3394, in raw_connection
    return self._wrap_pool_connect(self.pool.connect, _connection)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 3361, in _wrap_pool_connect
    return fn()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 310, in connect
    return _ConnectionFairy._checkout(self)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 868, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 476, in checkout
    rec = pool._do_get()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/impl.py", line 146, in _do_get
    self._dec_overflow()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 72, in __exit__
    with_traceback=exc_tb,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/impl.py", line 143, in _do_get
    return self._create_connection()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 256, in _create_connection
    return _ConnectionRecord(self)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 371, in __init__
    self.__connect()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/create.py", line 578, in connect
    return dialect.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 597, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 353, in __init__
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 632, in connect
    self._get_server_information()
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 1055, in _get_server_information
    packet = self._read_packet()
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 692, in _read_packet
    packet_header = self._read_bytes(4)
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 732, in _read_bytes
    data = self._rfile.read(num_bytes)
  File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt

I opened a quickstart pod independently and tried to call train.py. When it finally wrote the metric, I found that it was stuck connecting to the database, which means that the main problem was with database istio-proxy.

Screenshots (if appropriate)

No

Questions:

codecov[bot] commented 2 years ago

Codecov Report

Merging #991 (82ec839) into master (3fc12ad) will decrease coverage by 0.05%. The diff coverage is n/a.

@@             Coverage Diff              @@
##             master     #991      +/-   ##
============================================
- Coverage     14.26%   14.21%   -0.06%     
+ Complexity      991      987       -4     
============================================
  Files           241      241              
  Lines         23897    23897              
  Branches       3473     3473              
============================================
- Hits           3410     3396      -14     
- Misses        20280    20296      +16     
+ Partials        207      205       -2     
Impacted Files Coverage Δ
...ine/server/workbench/websocket/NotebookSocket.java 61.90% <0.00%> (-19.05%) :arrow_down:
.../server/workbench/websocket/ConnectionManager.java 39.02% <0.00%> (-14.64%) :arrow_down:
...ine/server/workbench/websocket/NotebookServer.java 44.00% <0.00%> (-8.00%) :arrow_down:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more