clj-python / libpython-clj

Python bindings for Clojure
Eclipse Public License 2.0
1.06k stars 69 forks source link

Weird behavior if using Pytorch Tensors with map #125

Closed Goldritter closed 3 years ago

Goldritter commented 3 years ago

Hi,

I try to using Pytorch und Hugginfaces Transformers to create Bert Word-Embeddings. Here is the code of the namespace which creates the Embbedings:

(ns python.bert.test
  (:require
    [clojure.core.memoize :as MEMO]
    [libpython-clj.require :refer [require-python]]
    [libpython-clj.python :as py :refer [py. py.. py.- run-simple-string
                                         as-python as-jvm
                                         ->python ->jvm
                                         get-attr call-attr call-attr-kw
                                         get-item att-type-map
                                         call call-kw initialize!
                                         as-numpy as-tensor ->numpy
                                         run-simple-string
                                         add-module module-dict
                                         import-module
                                         python-type]
     ]))

(require-python '[torch :as t])
(require-python '[builtins])
(py/from-import transformers BertTokenizer BertModel)

(def german-bert-path "./resources/nlp/german_bert/")

(defn get-tokenizer-from [^String path]
  (py. BertTokenizer from_pretrained path))

(defn get-bert-model [^String path]
  (py.
    (py. BertModel from_pretrained path :output_hidden_states true)
    eval))

(def time-to-live (* 1000 60 10))
(def memonized-get-tokenizer-from (MEMO/ttl get-tokenizer-from :ttl/threshold time-to-live))
(def memonized-get-bert-model (MEMO/ttl get-bert-model :ttl/threshold time-to-live))

(defn with-no-grad [executed-fn]
  (let [no-grad (t/no_grad)]
    (try
      (py. no-grad __enter__)
      (executed-fn)
      (finally
        (py. no-grad __exit__)))))

(defn generate-input-string-token-information [sentences path]
  (let [tokenz (py. (memonized-get-tokenizer-from path) tokenize (str "[CLS]" (clojure.string/join "[SEP]"
                                                                                                   (if (coll? sentences) sentences [sentences]))))
        indexed-tokenz (py. (memonized-get-tokenizer-from path) convert_tokens_to_ids tokenz)
        tokenz-tensor (t/tensor [indexed-tokenz])]

    {:tokenz                tokenz
     :indexed-tokenz-tensor tokenz-tensor
     :segments-tensors      (:segments-tensors (reduce #(if (= "[SEP]" %2)
                                                          (update-in
                                                            (update-in %1 [:segments-tensors] conj (:sentence-id %1))
                                                            [:sentence-id] inc)
                                                          (update-in %1 [:segments-tensors] conj (:sentence-id %1))
                                                          ) {:sentence-id      0
                                                             :segments-tensors []} tokenz))}
    ))

(defn get-word-embeddings-for-token-information [token-information ^String path]
  (let [index-tensor (:indexed-tokenz-tensor token-information)]
    (py. (t/squeeze (t/stack
                      (with-no-grad
                        #(nth ((memonized-get-bert-model path)
                               index-tensor
                               (t/tensor [(:segments-tensors token-information)])) 2))
                      :dim 0) :dim 1) permute 1 0 2)))

(defn token-embeddings-of-sentences [sentences ^String path]
  (get-word-embeddings-for-token-information (generate-input-string-token-information sentences path) path))

(defn get-scalar-value [tensor-scalar]
  (py. tensor-scalar item))

(defn concat-last-n-layers [n token-embedding]
  (pmap #(into [] (map get-scalar-value
                       (apply concat (take-last n %1)))) token-embedding))

(defn get-second-last-layer [token-embedding]
  (doall (map #(py/get-item %1 -2) token-embedding)))

If I now use the call (BERT/get-second-last-layer (BERT/token-embeddings-of-sentences "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt." BERT/german-bert-path)) in a namespace wehere I used (:require [python.bert.test :as BERT]) it works fine without any problems.

But if I want to work with a sequence of sentences with (map #(BERT/get-second-last-layer (BERT/token-embeddings-of-sentences %1 BERT/german-bert-path)) [<Many sentences>])) or (map #(BERT/token-embeddings-of-sentences %1 BERT/german-bert-path) [<Many sentences>])) the program stuck randomly and I need to restart the entire REPL. On average after 3 calls will the program stuck.

I get no Error Message and if I stop the execution and use the call again the program stuck again. The same happens if I used a loop or a for.

I tried many methods to transform the Tensor directly to a list and work with it or I work only with the Tensor or I save the loaded models directly as variable in the namespace instead to use Memonize, but I always encounter the same weird behavior.

Has anybody an idea what I#m doing wrong?

Best Goldritter

Goldritter commented 3 years ago

Short update: I tried to make the function "get-word-embeddings-for-token-information" more in touch with Python and remove the nth call but I get the same behavior described above.

(defn get-word-embeddings-for-token-information [token-information ^String path]
  (py. (t/squeeze (t/stack
                    (py/with [r (t/no_grad)]
                             (py/get-item
                               ((memonized-get-bert-model path)
                                (:indexed-tokenz-tensor token-information)
                                (t/tensor [(:segments-tensors token-information)]))
                               2))
                    :dim 0) :dim 1) permute 1 0 2))
Goldritter commented 3 years ago

After I changed "get-word-embeddings-for-token-information" and loaded the model directly at the beginning I could make at least 50 calls of (map #(BERT/token-embeddings-of-sentences %1 BERT/german-bert-path) [<Many Sentences>]) but when I wanted to make another 50 calls the program blocks/stucks again.

I tested this on a Windows 10 System with Python 3.8.5 and on a Ubuntu Linux with Python 3.8. On both systems I encounter the same problem. Both uses the same PyTorch and Hugging Faces Transformers versions.

(def my-bert-model (py.
                     (py. BertModel from_pretrained german-bert-path :output_hidden_states true)
                     eval))

(defn get-word-embeddings-for-token-information [token-information &{:keys [getter-fn]
                                                                     :or {getter-fn identity}}]
  (py/with [r (t/no_grad)]
           (getter-fn (py. (t/squeeze (t/stack
                             (py/get-item
                               (my-bert-model
                                 (:indexed-tokenz-tensor token-information)
                                 (t/tensor [(:segments-tensors token-information)]))
                               2) :dim 0)
                           :dim 1) permute 1 0 2))))

And before I forget it. I tried to implement the steps from this blog (https://medium.com/@dhartidhami/understanding-bert-word-embeddings-7dc4d2ea54ca)

I have no clue now what I have to change so that the program will work without the random blocks if I use Tensors.

jjtolton commented 3 years ago

I notice some of the syntax you're using is from older version of libpython-clj. Could you paste in deps.edn or project.clj and output of pip freeze?

Goldritter commented 3 years ago

No Problem. Here is the output from pip freeze:

absl-py==0.10.0 astunparse==1.6.3 attrs==20.2.0 cachetools==4.1.1 certifi==2020.6.20 cffi==1.14.3 chardet==3.0.4 click==7.1.2 cycler==0.10.0 Cython==0.29.21 dataclasses==0.6 dill==0.3.2 dm-tree==0.1.5 filelock==3.0.12 future==0.18.2 gast==0.3.3 gin-config==0.3.0 google-api-core==1.22.2 google-api-python-client==1.12.1 google-auth==1.21.1 google-auth-httplib2==0.0.4 google-auth-oauthlib==0.4.1 google-cloud-bigquery==1.27.2 google-cloud-core==1.4.1 google-crc32c==1.0.0 google-pasta==0.2.0 google-resumable-media==1.0.0 googleapis-common-protos==1.52.0 grpcio==1.32.0 h5py==2.10.0 httplib2==0.18.1 idna==2.10 joblib==0.16.0 kaggle==1.5.8 Keras==2.4.3 Keras-Preprocessing==1.1.2 kiwisolver==1.2.0 Markdown==3.2.2 matplotlib==3.3.2 numpy==1.18.5 oauthlib==3.1.0 opencv-python-headless==4.4.0.42 opt-einsum==3.3.0 packaging==20.4 pandas==1.1.2 Pillow==7.2.0 promise==2.3 protobuf==3.13.0 psutil==5.7.2 py-cpuinfo==7.0.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycparser==2.20 pyparsing==2.4.7 python-dateutil==2.8.1 python-slugify==4.0.1 pytz==2020.1 PyYAML==5.3.1 regex==2020.7.14 requests==2.24.0 requests-oauthlib==1.3.0 rsa==4.6 sacremoses==0.0.43 scipy==1.4.1 sentencepiece==0.1.91 six==1.15.0 slugify==0.0.1 tensorboard==2.3.0 tensorboard-plugin-wit==1.7.0 tensorflow==2.3.0 tensorflow-addons==0.11.2 tensorflow-datasets==3.2.1 tensorflow-estimator==2.3.0 tensorflow-hub==0.9.0 tensorflow-metadata==0.24.0 tensorflow-model-optimization==0.5.0 termcolor==1.1.0 text-unidecode==1.3 tf-models-official==2.3.0 tf-slim==1.1.0 tokenizers==0.8.1rc2 torch==1.6.0+cpu torchvision==0.7.0+cpu tqdm==4.49.0 transformers==3.1.0 typeguard==2.9.1 uritemplate==3.0.1 urllib3==1.24.3 Werkzeug==1.0.1 wrapt==1.12.1

And here the project.clj:

(defproject QCLASH "0.1.0-SNAPSHOT"
  :description ""
  :license {:name "Eclipse Public License"
            :url  "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [
                 ;; Clojure core
                 [org.clojure/clojure "1.10.0"]
                 [clj-time "0.15.0"]
                 [org.clojure/core.memoize "0.7.2"]
                 [lein-jdk-tools "0.1.1"]
                 [org.clojure/math.combinatorics "0.1.6"]
                 [clojurewerkz/propertied "1.2.0"]
                 [clojure.java-time "0.3.2"]
                 [com.climate/claypoole "1.1.4"]
                 [nrepl/nrepl "0.8.2"]
                 [org.wikiclean/wikiclean "1.2"]

                 [environ "1.2.0"]           

                 ;; Web
                 [compojure "1.6.1"]
                 [http-kit "2.3.0"]
                 [clj-http "3.10.1"]
                 [ring/ring-defaults "0.3.2"]
                 [buddy/buddy-auth "2.2.0"]

                 ;; Google/Guava
                 [com.google.code.gson/gson "2.8.5"]
                 [com.google.guava/guava "28.1-jre"]

                 ;; Im- & Exports
                 [net.sourceforge.htmlunit/htmlunit "2.19"]
                 [cheshire "5.10.0"]
                 [org.clojure/data.csv "0.1.3"]
                 [ultra-csv "0.2.3"]
                 [com.taoensso/nippy "2.14.0"]
                 [com.taoensso/timbre "4.10.0"]
                 [uk.org.russet/tawny-owl "2.0.2"]

                 ;Matrix
                 [net.mikera/core.matrix "0.62.0"]
                 [net.mikera/vectorz-clj "0.48.0"]
                 [clatrix "0.5.0"]
                 [uncomplicate/neanderthal "0.27.0"]

                 ;; Graph
                 [aysylu/loom "1.0.2"]
                 [ubergraph "0.8.1"]

                 ;; Data Bases
                 [com.novemberain/monger "3.5.0"]
                 [korma "0.4.3"]
                 [org.clojure/java.jdbc "0.7.10"]
                 [org.postgresql/postgresql "42.2.6"]
                 [org.mongodb/mongo-java-driver "3.10.2"]

                 ;; Knowledge Engineering / Statistical analysis
                 [incanter "1.9.3" :exclusions [incanter/incanter-pdf]]
                 [nz.ac.waikato.cms.weka/weka-stable "3.8.4"]
                 [nz.ac.waikato.cms.weka/chiSquaredAttributeEval "1.0.4"]
                 [nz.ac.waikato.cms.weka/LibSVM "1.0.10"]
                 [com.datumbox/libsvm "3.23"]
                 [clj-fuzzy "0.4.1"]

                 [de.lmu.ifi.dbs.elki/elki "0.7.5" :exclusions [net.jafama/jafama it.unimi.dsi/fastutil org.apache.logging.log4j/log4j
                                                                nz.ac.waikato.cms.weka/weka-dev tw.edu.ntu.csie/libsvm
                                                                org.apache.xmlgraphics/batik-js]]

                 [org.deeplearning4j/deeplearning4j-core "1.0.0-beta7"]
                 [org.deeplearning4j/deeplearning4j-nlp "1.0.0-beta7"]
                 [org.deeplearning4j/deeplearning4j-zoo "1.0.0-beta7"]

                 [org.nd4j/nd4j-cuda-10.1-platform "1.0.0-beta7"]
                 [org.nd4j/nd4j-api "1.0.0-beta7"]
                 [org.nd4j/nd4j-native "1.0.0-beta7"]
                 [org.nd4j/jackson "1.0.0-beta7"]

                 [org.springframework/spring-core "5.1.8.RELEASE"]

                 ;; Python and Tensorflow
                 [org.tensorflow/tensorflow "1.15.0"]
                 [org.tensorflow/libtensorflow_jni_gpu "1.15.0"]
                 [clj-python/libpython-clj "1.46"]

                 ;; Logging
                 [org.clojure/tools.logging "1.1.0"]
                 [org.apache.logging.log4j/log4j-core "2.13.0"]
                 [org.apache.logging.log4j/log4j-slf4j-impl "2.13.0"]

                 ;; NLP
                 ;; [org.apache.commons/commons-lang3 "3.9"]
                 ;; [org.languagetool/languagetool-core "4.7"]
                 ;; [org.languagetool/language-all "4.7"]
                 [org.bridgei2i/word2vec "0.2.2"]
                 [com.medallia.word2vec/Word2VecJava "0.10.3"]
                 [de.danielnaber/jwordsplitter "4.4"]       ;; can split german words into seperated nouns
                 [org.jsoup/jsoup "1.9.1"]
                 [edu.stanford.nlp/stanford-corenlp "3.9.2"]
                 [edu.stanford.nlp/stanford-corenlp "3.9.2" :classifier "models"]
                 [edu.stanford.nlp/stanford-corenlp "3.9.2" :classifier "models-german"]
                 [edu.stanford.nlp/stanford-corenlp "3.9.2" :classifier "models-french"]
                 [marcliberatore.mallet-lda/marcliberatore.mallet-lda "0.1.1"]
                 [org.apache.opennlp/opennlp-tools "1.9.1"] ;;Used only for the stemmer
                 ]

  :exclusions [org.apache.xmlgraphics/batik-js
               org.apache.logging.log4j/log4j
               nz.ac.waikato.cms.weka/weka-dev org.apache.xmlgraphics/batik-js
               tw.edu.ntu.csie/libsvm org.apache.xmlgraphics/batik-js]

  :repl-options {:init-ns python.config}
  :plugins [[lein-environ "1.2.0"]]

  :profiles {:dev             [:project/dev :profiles/dev]
             :linux         [:project/linux :profiles/linux]
             ;; only edit :profiles/* in profiles.clj
             :profiles/dev    {:env {:python-executable "C:/opt/Python/Python3.8"
                                     :library-path      "C:/opt/Python/Python3.8/python38.dll"}}
             :profiles/linux {:env {:python-executable "/usr/bin/python3"
                                     :library-path      "/usr/lib/python3.8/"}}
             :project/dev     {}
             :project/linux  {}}

  ;; Java relevant information
  :java-source-paths ["src/java"]
  :resource-paths ["resources/"]
  :jvm-opts ["-Xmx40g" "-Xms24g" "-XX:-OmitStackTraceInFastThrow" "-XX:-UseGCOverheadLimit" "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED"
             "-Dclojure.tools.loggin.factory=clojure.tools.logging.impl/log4j2-factory"]
  :repositories [["java.net" "http://download.java.net/maven/2"]])

I hope this helps.

jjtolton commented 3 years ago

Cool. I'll give some tips b/c I won't have time to dive in for a little while. Version info looks solid from what I can see.

The only two known sources of REPL hang behaviour with 1.46 fall into two categories:

  1. Concurrency, intentional or otherise; and
  2. User error or misunderstanding REPL output

On the first point, there's a known issue with pytorch and a few other libraries that use concurrency under the hood. If you can narrow it down to a concurrency issue, I can help you monkeypatch the Python code.

On the second point, the way this usually occurs is the REPL isn't actually "hanging", it's just taking longer than you would expect to compute the result. This typically happens with porting a routine/algo from Python -> libpython-clj and introducing accidental time complexity or some other minor bug. One place I often see this:

for ax, bx in zip(a, b):
    something(a, b)

is not the same as

(for [ax a bx b] (something a b))

The clojure example would be O(a*b) complexity while the Python example would be O(max(a,b)) time complexity. Be on the look out for little things like that which can often be subtle.

One other tip:

(-> huge-vec (python-fn1) (python-fn2))

If in the above example, python-fn1 and python-fn2 return huge vectors (like, 100,000+ items), you are going to pay a heavy copy price, so don't do things like that in a tight loop. If you want to go the "zero copy route", Chris has put in excellent support for numpy arrays.

If you're still having issues, keep posting until we get it worked out or stop by in the zulip chat.

Goldritter commented 3 years ago

Thanks for the tips.

If this helps, I performed some tests with debugging outputs where the number of sentences was 13 (always the same sentence). The output was the sentence number which was actually passed and the steps ("BERT" for the function "get-word-embeddings-for-token-information " and "Second Layer" for the function "get-second-last-layer". Every process of one sentence was not more than 1 second.

For the first iterations I get the output: Sentence : 0, BERT, Second layer Sentence :1, BERT, Second layer ... Sentence : 12, BERT, Second layer

Then after I called it many times it looks like Sentence : 0, BERT, Second layer Sentence :1, BERT, Second layer ... Sentence : 7, BERT, Second layer And here it Blocks (where it sometimes the sentence number was 7, 9 or 10)

If I does not use the function second Layer the first batch of 50 calls looks like: Sentence : 0, BERT Sentence : 1, BERT ... Sentence : 12, BERT

Sentence : 0, BERT Sentence : 1, BERT ... Sentence : 12, BERT

Sentence : 0, BERT Sentence : 1, BERT ... Sentence : 12, BERT (50 times)

Then the second batch of 50 like Sentence : 0, BERT Sentence : 1, BERT ... Sentence : 12, BERT

Sentence : 0, BERT Sentence : 1, BERT ... Sentence : 12, BERT (a few times) and then Sentence : 0, BERT Sentence :1, BERT ... Sentence : 7, BERT

I will add some more debug logouts in my code to specify if the loaded Bert model is the issue due "under the hood concurrency" or if the problem are the huge tensors. If I know more I will write again.

Goldritter commented 3 years ago

One thing I noticed now. I added now debug outputs when I load the Bert model and define the time to live values.

(def my-bert-model
  (do (println "Load Bert model")
      (py.
        (py. BertModel from_pretrained german-bert-path :output_hidden_states true)
        eval)))

(def time-to-live
  (do (println "Define time to live")
          (* 1000 60 10)))

When I first load the namespace into the REPL everything works fine. But when I reload it later the program blocks.

Load Bert model Define time to live Loaded (in-ns 'python.bert.test) => #object[clojure.lang.Namespace 0x28d1f3a "python.bert.test"] Loading src/python/bert/test.clj... Load Bert model

I use as IDE IntelliJ with the Cursive Plugin.

As it seems the model might be problematic in some circumstances. But I will dig deeper into it and add new information here.

Goldritter commented 3 years ago

As it seems I cannot get a grip where the Problem is.

Here is the actual code and here I cannot even parse the first sentence and the Tokenizer is not loaded. But yesterday the tokenizer has been loaded without problem in the memoize methods.

To test I used (count (get-second-last-layer-embedding sentences)). Another information, in the Task Manger the CPU goes for a few seconds high (70%-80% on an i5) and then it goes down to 6%.

(ns python.bert.test
  (:require
    [clojure.core.memoize :as MEMO]
    [libpython-clj.require :refer [require-python]]
    [libpython-clj.python :as py :refer [py. py.. py.- run-simple-string
                                         as-python as-jvm
                                         ->python ->jvm
                                         get-attr call-attr call-attr-kw
                                         get-item att-type-map
                                         call call-kw initialize!
                                         as-numpy as-tensor ->numpy
                                         run-simple-string
                                         add-module module-dict
                                         import-module
                                         python-type]
     ]))

(require-python '[torch :as t])
(require-python '[builtins])
(py/from-import transformers BertTokenizer BertModel)

(def german-bert-path "./resources/nlp/german_bert/")

(defn get-tokenizer-from [^String path]
  (py. BertTokenizer from_pretrained path))

(defn get-bert-model [^String path]
  (py.
    (py. BertModel from_pretrained path :output_hidden_states true)
    eval))

(def my-bert-model
  (do (println "Load Bert model")
      (py.
        (py. BertModel from_pretrained german-bert-path :output_hidden_states true)
        eval)))

(def my-bert-tokenizer
  (do (println "Load Bert tokenizer")
      (py. BertTokenizer from_pretrained german-bert-path)))

(def time-to-live
  (do (println "Define time to live")
      (* 1000 60 10)))

(def memonized-get-tokenizer-from (MEMO/ttl get-tokenizer-from :ttl/threshold time-to-live))
(def memonized-get-bert-model (MEMO/ttl get-bert-model :ttl/threshold time-to-live))

(defn generate-input-string-token-information [sentences path]
  (do (print "Tokenize")
      (let [tokenz (py. my-bert-tokenizertokenize (str "[CLS]" (clojure.string/join "[SEP]"
                                                                                    (if (coll? sentences) sentences [sentences]))))
            indexed-tokenz (py. my-bert-tokenizer convert_tokens_to_ids tokenz)
            tokenz-tensor (t/tensor [indexed-tokenz])]

        {:tokenz                tokenz
         :indexed-tokenz-tensor tokenz-tensor
         :segments-tensors      (:segments-tensors (reduce #(if (= "[SEP]" %2)
                                                              (update-in
                                                                (update-in %1 [:segments-tensors] conj (:sentence-id %1))
                                                                [:sentence-id] inc)
                                                              (update-in %1 [:segments-tensors] conj (:sentence-id %1))
                                                              ) {:sentence-id      0
                                                                 :segments-tensors []} tokenz))}
        )))

(defn get-word-embeddings-for-token-information [token-information & {:keys [getter-fn]
                                                                      :or   {getter-fn identity}}]
  (do (print "Start Bert, ")
      (py/with [r (t/no_grad)]
               (getter-fn
                 (py. (t/squeeze (t/stack
                                   (py/get-item
                                     (my-bert-model
                                       (:indexed-tokenz-tensor token-information)
                                       (t/tensor [(:segments-tensors token-information)]))
                                     2) :dim 0)
                                 :dim 1) permute 1 0 2)))))

(defn token-embeddings-of-sentences [sentences ^String path & {:keys [getter-fn]
                                                               :or   {getter-fn identity}}]
  (get-word-embeddings-for-token-information (generate-input-string-token-information sentences path)
                                             :getter-fn getter-fn
                                             ))

(defn get-scalar-value [tensor-scalar]
  (py. tensor-scalar item))

(defn concat-last-n-layers [n token-embedding]
  (pmap #(into [] (map get-scalar-value
                       (apply concat (take-last n %1)))) token-embedding))

(defn get-second-last-layer [token-embedding]
  (do (println "Second-Layer")
      (doall (map #(py/get-item %1 -2) token-embedding))))

(defn get-second-last-layer-embedding [sentences]
  (let [used-sentences (if (coll? sentences) sentences [sentences])]
    (if (> (count (apply concat used-sentences)) 1000)
      (doall
        (map #(do
                (println "")
                (println (str "Sentence:" %2))
                (token-embeddings-of-sentences %1 german-bert-path))
             used-sentences (range 0 (count used-sentences))))
      (token-embeddings-of-sentences used-sentences german-bert-path))))

(def sentences
  ["Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."])
cnuernber commented 3 years ago

We have had several major problems specifically with pytorch. We need to liaison a pytorch developer.

Searching issues for pytorch I think will be helpful/disappointing.

Goldritter commented 3 years ago

If it helps to get closer to the problem. I tested some more ides and changed some loading orders in the code and added debugoutputs (see end if this post) and executed following call

(def result (map #(get-second-last-layer-embedding %1 %2) (repeat 13 sentences) (range 1 14)) (count `result)`.

The output has been:

Iteration (1), sentence (0): Start embedding, Tokenize, Start Bert, Iteration (1), sentence (1): Start embedding, Tokenize, Start Bert, Iteration (1), sentence (2): Start embedding, Tokenize, Start Bert, Iteration (1), sentence (3): Start embedding, Tokenize, Start Bert, . . . Iteration (9), sentence (9): Start embedding, Tokenize, Start Bert, Iteration (9), sentence (10): Start embedding, Tokenize, Start Bert, Iteration (9), sentence (11): Start embedding, Tokenize, Start Bert, Iteration (9), sentence (12): Start embedding, Tokenize, Start Bert, Iteration (10), sentence (0):

What is odd, is that the function "get-second-last-layer-embedding" was called and executed the prints in do of the map-part but "token-embeddings-of-sentences" has not been executed.

I fear that this is not only a PyTorch "Problem" but also a Hugginfaces Transformes problem. Because when I changed the loading order from first model than tokenizer to first tokenizer and then model I could start the process and could perform some more iterations.

By the way, for an other problem I used the function "run-simple-string" and ended up with a lot of global variables in the used python environment. Is there a way to verify the memory usage of the python environment? I do not think that the memory usage of the JVM in the Taskmanger also include the memory usage of the python environment. I ask, because I think that the model might cause a memory leak in the python environment and that might be the reason for the problem. Or does Python throws an exception in this case?

(ns python.bert.test
  (:require
    [clojure.core.memoize :as MEMO]
    [libpython-clj.require :refer [require-python]]
    [libpython-clj.python :as py :refer [py. py.. py.- run-simple-string
                                         as-python as-jvm
                                         ->python ->jvm
                                         get-attr call-attr call-attr-kw
                                         get-item att-type-map
                                         call call-kw initialize!
                                         as-numpy as-tensor ->numpy
                                         run-simple-string
                                         add-module module-dict
                                         import-module
                                         python-type]
     ]))

(require-python '[torch :as t])
(require-python '[builtins])
(py/from-import transformers BertTokenizer BertModel)

(def german-bert-path "./resources/nlp/german_bert/")

(defn get-tokenizer-from [^String path]
  (py. BertTokenizer from_pretrained path))

(defn get-bert-model [^String path]
  (py.
    (py. BertModel from_pretrained path :output_hidden_states true)
    eval))

(def my-bert-tokenizer
  (do (println "Load Bert tokenizer")
      (py. BertTokenizer from_pretrained german-bert-path)))

(def my-bert-model
  (do (println "Load Bert model")
      (py.
        (py. BertModel from_pretrained german-bert-path :output_hidden_states true)
        eval)))

(def time-to-live
  (do (println "Define time to live")
      (* 1000 60 10)))

(def memonized-get-tokenizer-from (MEMO/ttl get-tokenizer-from :ttl/threshold time-to-live))
(def memonized-get-bert-model (MEMO/ttl get-bert-model :ttl/threshold time-to-live))

(defn generate-input-string-token-information [sentences]
  (do (print "Tokenize, ")
      (let [tokenz (py. my-bert-tokenizer tokenize (str "[CLS]" (clojure.string/join "[SEP]"
                                                                                     (if (coll? sentences) sentences [sentences]))))
            indexed-tokenz (py. my-bert-tokenizer convert_tokens_to_ids tokenz)
            tokenz-tensor (t/tensor [indexed-tokenz])]

        {:tokenz                tokenz
         :indexed-tokenz-tensor tokenz-tensor
         :segments-tensors      (:segments-tensors (reduce #(if (= "[SEP]" %2)
                                                              (update-in
                                                                (update-in %1 [:segments-tensors] conj (:sentence-id %1))
                                                                [:sentence-id] inc)
                                                              (update-in %1 [:segments-tensors] conj (:sentence-id %1))
                                                              ) {:sentence-id      0
                                                                 :segments-tensors []} tokenz))}
        )))

(defn get-word-embeddings-for-token-information [token-information & {:keys [getter-fn]
                                                                      :or   {getter-fn identity}}]
  (do (print "Start Bert, ")
      (py/with [r (t/no_grad)]
               (py. (t/squeeze (t/stack

                                 (py/get-item
                                   (my-bert-model
                                     (:indexed-tokenz-tensor token-information)
                                     (t/tensor [(:segments-tensors token-information)]))
                                   2)

                                 :dim 0)
                               :dim 1) permute 1 0 2))))

(defn token-embeddings-of-sentences [sentences ^String & {:keys [getter-fn]
                                                          :or   {getter-fn identity}}]
  (do (print "Start embedding, ")
      (get-word-embeddings-for-token-information
        (generate-input-string-token-information sentences)
        :getter-fn getter-fn
        )))

(defn get-scalar-value [tensor-scalar]
  (py. tensor-scalar item))

(defn concat-last-n-layers [n token-embedding]
  (pmap #(into [] (map get-scalar-value
                       (apply concat (take-last n %1)))) token-embedding))

(defn get-second-last-layer [token-embedding]
  (do (println "Second-Layer")
      (doall (map #(py. %1 get-item -2) token-embedding))))

(defn get-second-last-layer-embedding [sentences iteration]
  (let [used-sentences (if (coll? sentences) sentences [sentences])]
    (if (> (count (apply concat used-sentences)) 1000)
      (doall
        (map #(do
                (println "")
                (println (str "Iteration (" iteration"), sentence (" %2 "): "))
                (token-embeddings-of-sentences %1))
             used-sentences (range 0 (count used-sentences))))
      (token-embeddings-of-sentences used-sentences))))

(def sentences
  ["Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."
   "Die 87-jährige Franziska R. wird wegen zunehmender Dyspnoe und Hustenanfällen aus der Augenklinik in die internistische Abteilung eines Klinikums verlegt."])
jjtolton commented 3 years ago

This is almost certainly at least one of the problems:

(defn concat-last-n-layers [n token-embedding]
  (pmap #(into [] (map get-scalar-value
                       (apply concat (take-last n %1)))) token-embedding))

as get-scalar-value corresponds to

(defn get-scalar-value [tensor-scalar]
  (py. tensor-scalar item))

Python concurrency and Clojure concurrency do not mix! You WILL get threadlock and/or GIL lock fairly quickly.

Goldritter commented 3 years ago

OK. Thanks for the Info. At least I did not use these two functions for my tests. I test with "get-second-last-layer-embedding" which calls "token-embeddings-of-sentences".

But good to know that.

jjtolton commented 3 years ago

Is it possible for you to make a GitHub project and include steps how to reproduce this issue? I'll see if I can pinpoint the issue.

Goldritter commented 3 years ago

I created a repository with the code and the project.clj I use and invited you as a collaborator.

You need also the pytorch_model.bin, tf_model.h5, config.json and vocab.txt.

Because the large file storage does not work correctly for me you can download these files are downloadable here: https://drive.google.com/drive/folders/1lfeiVRk37WfwQvgziKeKrGe_HbeVuABC?usp=sharing

The Setting for the python environment is done in the project.clj under profiles. I use lein-environ to set the values and read them in python.config.

You can find the file with the code in "python.bert.test". But before it works you must copy the above mentioned files into "./resources/nlp/german_bert/" or change the variable "german-bert-path" to the folder where these files are.

Besides Pytorch you also need Hugginface Transformers (https://github.com/huggingface/transformers). pip install transformers

To reproduce the behavior load and go into the namespace "python.bert.test" and run

(def result (map #(get-second-last-layer-embedding %1 %2) (repeat 13 sentences) (range 1 14)) (count result) Normally it should be enough to make the call once.

Thanks for your help and if you have any more questions, don't hesitate to ask.

jjtolton commented 3 years ago

@Goldritter got it. I'll take a look this weekend or sooner if I can!

jjtolton commented 3 years ago

Okay, finally got around to this. I don't really know how to explain it, but get-second-last-layer-embedding is acting as if it is a web API or something. Meaning, somehow a second call to get-second-last-layer-embedding is able to start before the first one finished if they are in a tight loop. This is causing the interpreter to lockup. So I treated it as an async problem and it works now. There's a lot going on here so it's hard for me to pin down a root cause. The working version of the code is as follows:

(require '[clojure.core.async :as a :refer [go go-loop]])
(def c (a/chan 3))
(a/offer! c true)
(go-loop [sentences (repeat 13 sentences)
          nums      (range 1 14)]
  (println "waiting...")
  (let [nums (or (not-empty nums) (range 1 14))]      
    (when (and (not-empty sentences)
               (a/<! c))
      (println "new layer!")
      (a/>! c (get-second-last-layer-embedding (first sentences) (first nums)))
      (recur (rest sentences) (rest nums)))))

This forces the next call to get-second-last-layer-embedding to occur only after the previous has completed. Do you have any insight into why get-second-last-layer-embedding has this behaviour?

I suspect it comes from this function:

(defn get-word-embeddings-for-token-information [token-information & {:keys [getter-fn]
                                                                      :or   {getter-fn identity}}]
  (do (print "Start Bert, ")
      (py/with [r (t/no_grad)]
               (py. (t/squeeze (t/stack

                                (py/get-item
                                 (my-bert-model
                                  (:indexed-tokenz-tensor token-information)
                                  (t/tensor [(:segments-tensors token-information)]))
                                 2)

                                :dim 0)
                               :dim 1) permute 1 0 2))))

but I'm not sure what this is doing, exactly. This is interesting behaviour though. Are you under the impression that this code is supposed to be synchronous?

Goldritter commented 3 years ago

Thanks for your work.

About get-second-last-layer-embedding. I thought of it as an optimized call for token-embeddings-of-sentences. Because if there are to many characters/tokens in the sentences the model causes an error similar to ArrayOutOfBoundsException. And if I have to many sentences which are processed separately it takes to much time so I use an If-clause based on the number of characters to decide if I can process all sentences together or has to separate them.

Here is the shorten code without the if-clause and separate processing of the sentences.

(defn get-second-last-layer-embedding [sentences]
        (map #(token-embeddings-of-sentences %1) sentences))

So yes, if map-function has no unexpected behavior the cause might be in get-word-embeddings-for-token-information which should be handled successively.

  1. the bert-model should process the indexed-tokenz and the segment-tensors which returns a tensor with the results of the layers of the neural net including the hidden layers (my-bert-model indexed-tokenz segments-tensor).
  2. the second value of the result tensor is extracted (py/get-item x 2).
  3. then stack function for dimension 0 is called on the extracted second value (t/stack x :dim 0).
  4. then the sequeeze function for the dimension 1 is called (t/squeeze x :dim 1).
  5. the positions are permuted (py. x permute 1 0 2).

If I understand the process correct, then all these functions are performed sequential on the Tensor-Object returned from the Bert-model respectively to the returned results of the functions py. , t/sequeeze, t/stack and py/get-item. (Eventually I should use the -> and ->> macros more often.) I tried to ensure that there is only a CPU-Tensor Object with (py. cpu) (see https://pytorch.org/docs/stable/tensors.html) but this had no effect either.

I tried to reproduce the steps from https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/ with token-embeddings-of-sentences with my Clojure code and there all steps are sequential.

If I understand your explanation correctly, then either the Bert-Model object or the Tensor object do some hidden synchronous work triggered by my functions calls on these objects.

Thanks again for your help.

jjtolton commented 3 years ago

Yes but my current hypothesis defies all logic and is deeply concerning. It would mean the clojure interpreter is moving on to the next frame before the current frame returns — I’ve never seen that inversion of control behavior before except in async code, and this not async code as far as I can tell.

The only other thing I can think of at the moment is that map is doing chunks of 32, so when you ask for the first result it really tries to rip through 31 more results. But again I would expect this to be a linear, synchronous process.

I can imagine some kind of very advanced trickery dropping down into C and escaping the GIL Or doing some very exotic inline async code that allows for parallel execution relative to your available threads which might explain this behavior— but you can see how far I’m grasping to try and wrap my head around this.

jjtolton commented 3 years ago

Is it possible that the return type of get-second-last-layer-embeddings is something equivalent to a promise or task or task-id which is used collect results later? Or possibly an object wrapping an async promise which somehow self-promotes to a value on completion of work?

Edit No that still wouldn’t explain the behavior directly. The async code I posted still wouldn’t work either. hmmmm deeply concerning and very interesting.

Goldritter commented 3 years ago

I get the same behavior with for, doseq and loop. And I don't believe for example that loop does process chunks. I never thought the problem was so profound.

And I think not that the result of get-second-last-layer-embeddings is a promise or similar. Except the map-function suddenly transforms the passed function #(token-embeddings-of-sentences %1) into a promise.

And token-embeddings-of-sentences should return a Tensor object as result of get-word-embeddings-for-token-information. Where I use only the functions py. , t/sequeeze, t/stack and py/get-item. Neither of these functions should return a promise or future. Only concrete results.

jjtolton commented 3 years ago

I won’t have time for awhile but the next step would be to try to reproduce this in CPython directly and run the code in a tight loop and see if there is similar behavior. Might have to or this will haunt my sleep haha

Goldritter commented 3 years ago

No Problem with the time. Luckily it is not time important. Thanks for your time and your help.

And I apologize if this keeps you from sleeping.

jjtolton commented 3 years ago

I get the same behavior with for, doseq and loop. And I don't believe for example that loop does process chunks. I never thought the problem was so profound.

And I think not that the result of get-second-last-layer-embeddings is a promise or similar. Except the map-function suddenly transforms the passed function #(token-embeddings-of-sentences %1) into a promise.

And token-embeddings-of-sentences should return a Tensor object as result of get-word-embeddings-for-token-information. Where I use only the functions py. , t/sequeeze, t/stack and py/get-item. Neither of these functions should return a promise or future. Only concrete results.

Okay, thanks for reporting -- saves me about an hour of tinkering. I think maybe I'll open an issue with pytorch and bug them to see if I can get anymore insight.

Edit Pytorch has over 5000 open issues right now so I doubt we'll hear from them anytime this decade, but issues like this https://github.com/pytorch/pytorch/issues/46386 make me think there is a lot of black magic going on with pytorch and in the short term we'll just have to be very clever in how we work around it. The codebase is also quite large and besides not having the bandwidth, my expertise stops where the C++ starts. So I think for now our best bet unless we can make friends with the pytorch team is solve libpython-clj pytorch issues on an adhoc basis, which I'm more than happy to help with!

jjtolton commented 3 years ago

@Goldritter do feel we can close this out or do you still require additional support?

Goldritter commented 3 years ago

I think we can close it now.

jjtolton commented 3 years ago

Thanks for the interesting puzzle. We've had ongoing issues with pytorch and I think this gave some valuable insight. Might be able to look at some other tickets now. Appreciate the work you did putting together the repro!

cnuernber commented 3 years ago

Perhaps research similar functionality via mxnet if possible. I have found that project to be in general much more solid. I know, however, they aren't as far in terms of some forms of NLP.

Goldritter commented 3 years ago

Thanks for the interesting puzzle. We've had ongoing issues with pytorch and I think this gave some valuable insight. Might be able to look at some other tickets now. Appreciate the work you did putting together the repro!

Thanks.
Glad to hear that this issue might be helpful to solve other other issues. And that it was not boring to work on it.