bnosac / textrank

Summarise text by finding relevant sentences and keywords using the Textrank algorithm
77 stars 9 forks source link

textrank_candidates_lsh() function returns error "Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :" #12

Closed MatthewHaas closed 10 months ago

MatthewHaas commented 10 months ago

I'm working with a collection of about 1200 short texts. Some are a single sentence while others are a paragraph. They are mostly descriptions of scholarships and who the recipients should be. Most of the texts contain language that preference will be given to certain students. The data below are fabricated for privacy reasons but are realistic. One of our main goals is to look for analyses beyond keyword search, so at this point it's just exploratory analysis.

I've gotten the function to work on other data that I have access to and am impressed with how it's working. I don't fully understand the technical reasons why it might not be working here, but I do wonder if it's because I'm doing something it wasn't designed to do. In the joboffer example, that was a single document of reasonable length. None of my texts are that long, so summarizing them individually makes little sense. Perhaps concatenating them in the way that I did isn't an appropriate method for applying the function to my texts.

Everything runs smoothly until I get to the this chunk of code:

candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$textrank_id,
                                      minhashFUN = minhash,
                                      bands = 500)

The error message that I'm getting is:

Error in vecseq(f, len, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 6728000 rows; more than 116000 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

I'm a little confused by how to overcome this because the merge that seems to be causing the problem is being called internally by the function. The suggestions to use the by=.EACHI or allow.cartesian=TRUE are not accepted by the function, which isn't too surprising. But I also haven't been able to pre-merge the data outside of the textrank_candidates_lsh() function using these suggestions. I did try asking chatGPT about it, but its suggestion was to remove duplicates and the solution was rather messy so I wasn't able to integrate it.

Sample fabricated data (for privacy reasons): Although short, these data do faithfully reproduce the error.

funds_abbr <- data.table(NBR_TEXT = c(1234, 2345, 3456, 4567, 5678), TEXT = c("The purpose of this scholarship is to help undergraduate or graduate students in the Computer Science department at the State University. Preference shall be given to students demonstrating financial need. The scholarship is renewable for 4 years, provided students remain in good academic standing.", "The purpose of this scholarship is to help undergraduate students at the State University earn their degree without taking on excessive student loan debt. Preference for this scholarship will be given to women.",  "To provide scholarships to underrepresented students studying agriculture.", "The purpose of this scholarship is to help students studying Business Analytics at the Business School of the State University. Preference will be given to students who come from minority backgrounds that are traditionally underrepresented at the State University or in the corporate world.", "The purpose of this fund is to help first-generation undergraduate students studying American Studies at the State University. Preference will be given to students who come from minority backgrounds and who are traditionally underrepresented at the State University."))

The full code for reproducibility is below:

library(textrank)
library(tm)
library(tokenizers)
library(igraph)
library(proxy)
library(udpipe)
library(textreuse)

setwd("path/to/my/directory")

# The udmodel is stored in an Rdata file and accessed with the 'object = udmodel$language' argument in the udpipe() function
# I had to get special permission to access the internet to download the file, which I can't do each time I work on this type of analysis
load("english_udmodel.Rdata")

# My data are stored in an Rdata file as a data.table as well so I can easily knit the Rmarkdown file (but through a database when done interactively)
load("unique_text_purpose.Rdata")

# Retaining the NBR_TEXT and TEXT columns is an artefact of previous work to meet the requirements of the DocumentTermMatrix() function but all I'm trying to do is retain unique IDs and the associated text
funds_abbr <- data[, c("NBR_TEXT", "TEXT")]

# When I ran this on my machine, I used a 500-text subset (out of ~1200 texts) which took ~2 hours to run
# The sample data consist of only 5 texts, but when I ran this on my machine, I used funds_abbr[1:500]$TEXT
x <- paste(funds_abbr$TEXT, collapse = "\n")
x <- udpipe(x, object = udmodel$language)

# Process data
x$textrank_id <- unique_identifier(x, c("doc_id", "paragraph_id", "sentence_id"))
sentences <- unique(x[, c("textrank_id", "sentence")])
terminology <- subset(x, upos %in% c("NOUN", "ADJ"))
terminology <- terminology[, c("textrank_id", "lemma")]

tr<- textrank_sentences(data = sentences, terminology = terminology)

lsh_probability(h = 500, b = 500, s = 0.1) # A 10 percent Jaccard overlap will be detected well
minhash <- minhash_generator(n = 500, seed = 999)
terminology <- subset(x, upos %in% c("NOUN", "ADJ"), select = c("textrank_id", "lemma"))

candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$textrank_id,
                                      minhashFUN = minhash,
                                      bands = 500)
MatthewHaas commented 10 months ago

After posting this, I went into the closed issues and found someone else had the same issue and the likely cause was duplicate sentences. I'm not surprised by this, but the duplicated sentences are important for my application, so I wonder if removing them is my only option.

jwijffels commented 10 months ago

Isn't this because you took h = 500 and bands = 500 where h is not bigger than bands?

MatthewHaas commented 10 months ago

Thanks for the suggestion. Making h larger than bands worked for the fabricated data that I provided, but unfortunately when I apply it to my real data it still fails (but it does take longer to do so). I appreciate your help on this, but I don't expect you to dedicate too much time troubleshooting this. I'll keep working on it and post an update if I figure it out.

lsh_probability(h = 100, b = 50, s = 0.1)
[1] 0.3949939
minhash <- minhash_generator(n = 100, seed = 999)
terminology <- subset(x, upos %in% c("NOUN", "ADJ"), select = c("textrank_id", "lemma"))
candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$textrank_id,
                                       minhashFUN = minhash,
                                       bands = 50)
jwijffels commented 10 months ago

Maybe if you run the code which is in textrank_candidates_lsh line by line, you will see which sentences fall in the same bucket in order to figure out and understand which sentences are blocking your analysis here.

https://github.com/bnosac/textrank/blob/master/R/textrank.R#L53-L84

MatthewHaas commented 10 months ago

Thanks for the suggestion. While I wasn't successful at running the textrank_candidates_lsh() function line by line, I think I am closer to a resolution. Over the weekend, I found ~20 descriptions that were identical (e.g. "This description is not available."). Removing those did not overcome the error.

After reading your comment, I removed additional sentences that are part of unique descriptions, but are themselves repetitive (once combined with all of the other descriptions). I wrote the code below to remove the duplicate sentences.

split_into_sentences <- function(text) {
  sent_tokenizer <- function(text) {
    unlist(strsplit(text, "(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?)\\s", perl=TRUE))
  }
  sentences <- sent_tokenizer(text)
  return(sentences)
}

# This was a test
original_string <- "Your input string here. It contains multiple sentences. Some sentences might be repetitive. Some sentences might be repetitive. Repetition should be removed."

sentences <- split_into_sentences(y)
dt <- data.table(sentence = sentences)

unique_dt <- unique(dt)

result_string <- paste(unique_dt$sentence, collapse = ' ')

This is successful at removing duplicate sentences, but it wasn't successful at resolving the error.

I then tried to pinpoint where the error is coming from. Through an iterative process, it seems like the 750th pair is where the error begins. Using the code below, I can get the textrank_candidates_lsh() function to work. This is all of the code since the above chunk removing duplicate sentences.

y <- udpipe(result_string, object = udmodel$language)
y$textrank_id <- unique_identifier(y, c("doc_id", "paragraph_id", "sentence_id"))

sentences <- unique(y[, c("textrank_id", "sentence")])

terminology <- subset(y, upos %in% c("NOUN", "ADJ"))
terminology <- terminology[, c("textrank_id", "lemma")]

terminology <- subset(x, upos %in% c("NOUN", "ADJ"), select = c("textrank_id", "lemma"))
terminology <- unique(terminology)
terminology <- terminology[c(1:749),]

candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$textrank_id,
                                      minhashFUN = minhash,
                                      bands = 250)

Looking at the surrounding textrank_id and lemma columns (using terminology[c(720:755),], I found a few lemmas that I thought might be causing conflicts, but none of them are at the 750th row. (e.g. textrank_id 94 has two similar lemmas "Architecture" and "architecture" -- differing only by capitalization)

I also looked at the results using tail(candidates). The code box below is what I get using the first 749 rows of the terminology object. I'm not sure if this is useful, but it makes me wonder if for the 750th row, there would be a combination of values for textrank_id_1 and textrank_id_2 that is not unique and is the reason for the error?

> tail(candidates)
     textrank_id_1 textrank_id_2
1462            39            48
1463             4            53
1464             4            68
1465            25            41
1466            13            38
1467            43             8

Do you think it's worthwhile exploring the terminology object to remove the offenders or is it better to keep my focus on the raw text?

jwijffels commented 10 months ago

I think you need to check the sentences where there are duplicates by taking the sentence_to_bucket object and looking at which sentence_id's (and the corresponding text) are mapped to the same bucket/band.

library(data.table)
x <- setDF(sentence_to_bucket)
dups <- which(duplicated(x[, c("bucket", "band")]))
dups <- paste(dups$bucket, dups$band, sep = "-")
subset(x, paste(bucket, band, sep = "-") %in% dups)
MatthewHaas commented 10 months ago

Thank you for your continued assistance and patience.

I think I am having some difficulty figuring out what these are doing/how I can adapt them to my text. Somehow I arrived at a value of 6 for the number of rows, but for the life of my can't remember how. examplehash seems here like a simple initialization since the string that is being passed to the minhash function (defined earlier/elsewhere using minhash_generator()) will ultimately be our text.

examplehash <- minhashFUN("detect the n in minhashFUN")
rows <- length(examplehash) / bands

In any case, at the initialization of the sentence_to_bucket list object, my understanding is that my objective is to compare the elements of each element in the list in order to identify duplicates.

This is the code from the documentation:

 sentence_to_bucket <- split(x, sentence_id)
  sentence_to_bucket <- mapply(sentence_id = names(sentence_to_bucket), sentence_to_bucket, FUN=function(sentence_id, words){
    buckets <- data.table(sentence_id = sentence_id,
                          hash = minhashFUN(words),
                          band = hash_bands)
    buckets <- buckets[, list(bucket = digest::digest(object = list(hashes = hash, b = band[1]))), by = list(sentence_id, band)]
    buckets <- buckets[, c("sentence_id", "bucket"), with = FALSE]
    buckets
  }, SIMPLIFY = FALSE)
  sentence_to_bucket <- data.table::rbindlist(sentence_to_bucket)

I made the following changes so that I could run it as stand-alone code:

sentence_to_bucket <- data.table::rbindlist(sentence_to_bucket)


This is the output, which doesn't look right to me:

sentence_to_bucket sentence_id band bucket 1: sentence_id 1 9f46839ab9547d7db247335f9af87870 2: sentence_id 2 b2e532b8e924fba4829530d6170381c5 3: sentence_id 3 50e1d7457a6c77a409d3c5a79583a8f2 4: sentence_id 4 62d61537deb8460ddb2cd5307a78d55b 5: sentence_id 5 88b1940be19510f7ace2b64b55650390


996: bucket 496 444b20f8dc29d11b04a31012da078e9f 997: bucket 497 262693f8c79e1174025752f026ddf4f8 998: bucket 498 646ca60d8037b1a1386fcea0fb544131 999: bucket 499 6d91b87bd4f52805d252c255faf2b9c2 1000: bucket 500 d3b9db650dbe1bca8e29503d90dfe6c5

I was able to find numerous duplicates using this code:

sentence_to_bucket <- split(terminology$lemma, terminology$textrank_id)
duplicates <- sentence_to_bucket[duplicated(sentence_to_bucket) | duplicated(sentence_to_bucket, fromLast = TRUE)]

So I think this confirms that there are duplicates that need to be filtered out, even if it doesn't lead me to the specific sentences.

jwijffels commented 10 months ago

column sentence_id should contain id's of sentences, not 'sentence_id' and 'bucket'

MatthewHaas commented 10 months ago

Thanks. I knew that was a problem but didn't know what I did to make that happen. I think it is resolved now. I'm not quite to the finish line, but I should get there tomorrow. I'll close it out then. I appreciate you helping me troubleshoot this.