Bridgei2i / clojure-word2vec

A Clojure wrapper around a Java implementation of Word2Vec.
Eclipse Public License 1.0
27 stars 8 forks source link

Porting word2phrase #1

Open Prog19 opened 8 years ago

Prog19 commented 8 years ago

A quick solution to this issue from the Java implementation would be downloading this code file (from the original C tool) and compiling, and executing it from Clojure. This marks the multi-word phrases with an underscore in between in the training text corpus. (Refer 'From words to phrases and beyond' from here)

Below is the code to run the executable in /resources in the project directory using Java Runtime instance and alternatively, by shelling out in Clojure. Here, the input is placed in /resources/train.txt, the output may be found at /resources/output/out.txt and the other parameters to the word2phrase training take default values.

(import '(java.lang Runtime Process))
(import '(java.io BufferedReader InputStreamReader))
(use '[clojure.java.shell :only [sh]])

(let [tmp (-> (System/getProperty "user.dir")
              (.replace "\\" "/")) ;File path modified for Unix. 
                ;Windows accepts both style file paths.
      res (str tmp "/resources/")]
    (comment
    (let [proc (.(Runtime/getRuntime) exec (str res "word2phrase.exe
                                              -train " res "train.txt
                                              -output " res "output/out.txt"))
          br (BufferedReader. (InputStreamReader. (.getInputStream proc)))]
        (println (clojure.string/join "\n" (line-seq br)))
        (.close br)))

    (println (:out (sh (str res "word2phrase.exe")
                          "-train" (str res "train.txt")
                          "-output" (str res "output/out.txt"))))
    (System/exit 0))    
shark8me commented 8 years ago

Thanks Pragati,

I feel the "right thing to do" is to implement it in Medallia's Java implementation, and then add a wrapper in this project for it. Compiling the C code is a good solution for "just making it work". However, it is likely to fail as there are a multitude of C compilers, which may give different errors, and a C compiler may be missing in the first place.

Prog19 commented 8 years ago

Agreed, Kiran! This sure is a dirty fix.