bnosac / ruimtehol

R package to Embed All the Things! using StarSpace
Mozilla Public License 2.0
101 stars 13 forks source link

Text Similarity #3

Closed deepanshu88 closed 5 years ago

deepanshu88 commented 5 years ago

I am trying to calculate text similarity between sentences. I have standardized medical services list containing text of service ( for e.g. consultation of neurologist). Every time hospital/clinic comes with their own service list so I need to map hospital's service list with standardized service list. I calculate TF-IDF cosine similarity between hospital's service with standardized service list using skip-gram tokens. I have been doing this for long time so I also have correct mapping of services of some 15 hospitals. By 'correct mapping', I mean medical experts from my organization provided correct mapping of services which are wrongly labelled or mapped using tf-idf cosine similarity algorithm. I want to use 'correct mapping' as text classification problem but no. of labels in this case is more than 10K. Is there a way to perform 'Supervised text similarity'? I tried to use ruimtehol package with trainMode = 3 in starspace function for calculating similarity but got no success. Getting error "Please check: is the file empty? Do the examples contain proper feature and label according to the trainMode"

See the example of my datasets below ( consider A as 'standardized service list', B as 'hospital's service list', C as 'correct mapping') .

A <- data.frame(name= c("Patient had X-ray right leg arteries.",
                         "Subject was administered Rgraphy left shoulder",
                         "Exam consisted of x-ray leg arteries",
                         "Patient administered x-ray leg with 20km distance."),
                row.names = paste0("A", 1:4), stringsAsFactors = FALSE)
B <- data.frame(name= c(B = "Patient had X-ray left leg arteries",
                         "Rgraphy right shoulder given to patient",
                         "X-ray left shoulder revealed nothing sinister",
                         "Rgraphy right leg arteries tested"), 
                row.names = paste0("A", 1:4), stringsAsFactors = FALSE)

C <- data.frame(name= c("Patient had X-ray right leg arteries.",
                         "Subject was administered Rgraphy left shoulder",
                         "Exam consisted of x-ray leg arteries",
                         "Patient administered x-ray leg with 20km distance."),
                mapping = c("Radiography right leg artery.",
                            "Radiography left shoulder",
                            "Radiography leg arteries",
                            "Radiography leg with more than 10km distance."),
                row.names = paste0("A", 1:4), stringsAsFactors = FALSE)

See the sample code I am using for calculating similarity. It works when trainMode = 0 but not when it is set 3.

library(ruimtehol)
library(fastrtext)
data(train_sentences, package = "fastrtext")

filename <- tempfile()
writeLines(text = paste(paste0("__label__", train_sentences$class.text),  tolower(train_sentences$text)),
           con = filename)

model <- starspace(file = filename, 
                   trainMode = 0, label = "__label__", 
                   similarity = "dot", verbose = TRUE, initRandSd = 0.01, adagrad = FALSE, 
                   ngrams = 1, lr = 0.01, epoch = 5, thread = 20, dim = 10, negSearchLimit = 5, maxNegSamples = 3)
k =predict(model, "We developed a two-level machine learning approach that in the first level considers two different 
        properties important for protein-protein binding derived from structural models of V3 and V3 sequences.")  

k$prediction[1,]

I am open for suggestions in performing supervised text similarity. Any help would be highly appreciated!

jwijffels commented 5 years ago

Can you provide a reproducible example please. I don't see any reproducible example.

jwijffels commented 5 years ago

Closing as no reproducible examples is given.