add more "advanced" search options

milanwiedemann commented 3 years ago

create categories, for example if user looks up word "meal" the app could also look for the terms "food" and "drink"

ChrisBeeley commented 3 years ago

Are we thinking of doing this by hand or programmatically? There are tf-idf and word vector based approaches that spring to mind. Perhaps @andreassot10 could advise

andreassot10 commented 3 years ago

I was thought that we could look at cosine similarities between word embeddings. As it turns out, it may not be that great a solution after all. And that's because "meals" may be highly correlated with many other words that are irrelevant to eating. Thus, words like "food" and "drink" wouldn't necessarily appear at the top of the correlations list (sorted in descending order).

There's a workaround though: We can manually specify the list of words that we believe are associated with the search word ("food" in this example) and return results in the following way: The results appearing first are the ones for the word with which "food" has the highest cosine correlation. Then will follow the results for the word with the second highest correlation with "food" etc. So what's returned would be based on a table that looks like this:

word      food
meals     0.9942232
meal      0.9700867
drink     0.9208235

I've done some experimentation with the implementation of Facebook's StarSpace in R (ruimtehol). Download this data and run this:

library(magrittr)

text_data_starspace <- text_data

# Text data in StarSpace-friendly format
text_data_starspace <- text_data_starspace %>% 
  dplyr::mutate(
    feedback = feedback %>% 
      strsplit(., "\\W") %>% 
      purrr::map_chr(
        ~ paste(setdiff(.x, ""), collapse = " ")
      ) %>% 
      tolower()
  )

# Calculate word embeddings. Can be slow so set maxTrainingTime to a few minutes (in seconds).
# Model will probably be rubbish, but should give you an idea.
model_wordspace <- ruimtehol::embed_wordspace(x = text_data_starspace$feedback, 
                                              model = "wordspace.bin",
                                              early_stopping = 0.8,
                                              validationPatience = 10,
                                              dim = 50,
                                              lr = 0.01, 
                                              epoch = 60, 
                                              loss = "softmax", 
                                              adagrad = TRUE, 
                                              similarity = "cosine", 
                                              negSearchLimit = 50,
                                              ngrams = 5, 
                                              minCount = 5,
                                              maxTrainTime = 3 * 60)

plot(model_wordspace)

# Matrix of word vectors
wordvectors <- as.matrix(model_wordspace)

# Data frame of cosine similarities between all word vectors
word_similarities <- wordvectors %>% 
  ruimtehol::embedding_similarity(wordvectors) %>% 
  as.data.frame() %>% 
  tibble::rownames_to_column() %>% 
  dplyr::rename(word = rowname)

# Table of cosine similarities between search word and possibly related words
word1 <- 'food'
word2 <- c('meal', 'meals', 'drink')
corr_threshold <- 0.7
word_similarities %>% 
  dplyr::select(word, {{word1}}) %>% 
  dplyr::arrange(
    dplyr::across(word1, ~ dplyr::desc(.))
  ) %>%
  dplyr::filter(
    dplyr::across({{word1}}, ~ . >= corr_threshold),
    !word %in% tidytext::stop_words$word,
    !word %in% word1,
    word %in% word2
  )

Note that this is early days and there's probably a much better way of tackling this issue. It smells like Python to me!

@ChrisBeeley, you said some approaches spring to mind. It would be good to share any links with us.

andreassot10 commented 3 years ago

Package tm could also be useful? https://fredgibbs.net/tutorials/document-similarity-with-r.html

ChrisBeeley commented 3 years ago

Nothing clever particularly. Just the word vector thing you mentioned, and also maybe looking at tf-idf values within particular categories (e.g. the words with the top 5 tf-idf from the "food" theme). In general I guess it would be fairly easy to pick up certain words with high tf-idf in each theme and bring back the whole theme for them (they could obviously turn this off because it would be quite over inclusive).

Could tweak it a bit I suppose depending on how much comes back- if there isn't a lot you could scrape the barrel a bit, like Google does with searches.

andreassot10 commented 3 years ago

Nothing clever particularly. Just the word vector thing you mentioned, and also maybe looking at tf-idf values within particular categories (e.g. the words with the top 5 tf-idf from the "food" theme). In general I guess it would be fairly easy to pick up certain words with high tf-idf in each theme and bring back the whole theme for them (they could obviously turn this off because it would be quite over inclusive).

Could tweak it a bit I suppose depending on how much comes back- if there isn't a lot you could scrape the barrel a bit, like Google does with searches.

I'm confused.

First of all, by "theme" you mean the code (Access, Miscellaneous etc.)?

Second, what do you mean by "[...] bring back the whole theme [...]?"

I need a clear explanation of what you two are after.

ChrisBeeley commented 3 years ago

First of all, by "theme" you mean the code (Access, Miscellaneous etc.)? Yes

Second, what do you mean by "[...] bring back the whole theme [...]?" Bring back everything tagged to that theme. If they search the word "nurse" they get back the whole "staff" theme.

Incidentally, we have talked about fitting models for some of the subthemes- "food" (from "Environment/ facilities") might be a good candidate for this. I imagine the TF-IDF would be reasonably different for food subthemes than for the rest of the Environment/ facilities category

andreassot10 commented 3 years ago

Thanks @ChrisBeeley.

Before delving into subthemes like "Food" from "Environment/f facilities", I thought it'd be a good idea to demonstrate you a process for relating words to themes that you may find useful. It uses ruimtehol again, only it builds a supervised model this time:

library(magrittr)

text_data_starspace <- text_data # Download from https://github.com/CDU-data-science-team/pxtextminingdashboard/blob/master/data/text_data.rda

# Text data in StarSpace-friendly format
text_data_starspace <- text_data_starspace %>% 
  dplyr::mutate(
    feedback = feedback %>% 
      strsplit(., "\\W") %>% 
      purrr::map_chr(
        ~ paste(setdiff(.x, ""), collapse = " ")
      ) %>% 
      tolower(),
    label = label %>% 
      as.character() %>%  
      strsplit(split = ",") %>% 
      purrr::map(~ gsub(" ", "-", .x))
  )

# Build supervised model
model_supervised <- ruimtehol::embed_tagspace(x = text_data_starspace$feedback, y = text_data_starspace$label,
                                   early_stopping = 0.8,
                                   validationPatience = 10,
                                   dim = 50,
                                   lr = 0.01, 
                                   epoch = 60, 
                                   loss = "softmax", 
                                   adagrad = TRUE, 
                                   similarity = "cosine", 
                                   negSearchLimit = 50,
                                   ngrams = 5, 
                                   minCount = 5)

plot(model_supervised)

# Dictionary (we won't be needing it- I'm just demonstrating it can be done)
dict <- ruimtehol::starspace_dictionary(model_supervised)
str(dict)

# Get embeddings of the dictionary of words as well as the categories
embedding_words <- as.matrix(model_supervised, type = "words")
embedding_labels <- as.matrix(model_supervised, type = "label")

# Find correlations between words and themes
corr_threshold <- 0.7
words <- c('nurse', 'ward')
embedding_labels %>% 
  ruimtehol::embedding_similarity(embedding_words) %>% 
  as.data.frame() %>%
  tibble::rownames_to_column() %>% 
  dplyr::rename(label = rowname) %>% 
  dplyr::mutate(label = sub("__label__", "", label)) %>% 
  tidyr::pivot_longer(cols = -1, names_to = "word") %>%
  dplyr::filter(
    !word %in% tidytext::stop_words$word,
    word %in% words,
    value >= 0.7
  ) %>% 
  dplyr::group_by(label) %>% 
  dplyr::arrange(word, desc(value))

# A tibble: 10 x 3
# Groups:   label [7]
#   label                   word  value
#   <chr>                   <chr> <dbl>
# 1 Staff                   nurse 0.961
# 2 Care-received           nurse 0.841
# 3 Dignity                 nurse 0.756
# 4 Staff                   ward  0.948
# 5 Care-received           ward  0.902
# 6 Dignity                 ward  0.841
# 7 Environment/facilities  ward  0.789
# 8 Access                  ward  0.762
# 9 Transition/coordination ward  0.758
#10 Communication           ward  0.758

ChrisBeeley commented 3 years ago

This looks really helpful. I can't process anything else before Wednesday, can we discuss on a call some time? Maybe when we're testing the Python pipeline.

It looks at first glance as though the > 0.7 is not discriminating very well- but > 0.9 would.

Please let's bring this and related matters on Wednesday for discussion

The-Strategy-Unit / experiencesdashboard

add more "advanced" search options #12