Closed MatthewHaas closed 10 months ago
After posting this, I went into the closed issues and found someone else had the same issue and the likely cause was duplicate sentences. I'm not surprised by this, but the duplicated sentences are important for my application, so I wonder if removing them is my only option.
Isn't this because you took h = 500
and bands = 500
where h is not bigger than bands?
Thanks for the suggestion. Making h
larger than bands
worked for the fabricated data that I provided, but unfortunately when I apply it to my real data it still fails (but it does take longer to do so). I appreciate your help on this, but I don't expect you to dedicate too much time troubleshooting this. I'll keep working on it and post an update if I figure it out.
lsh_probability(h = 100, b = 50, s = 0.1)
[1] 0.3949939
minhash <- minhash_generator(n = 100, seed = 999)
terminology <- subset(x, upos %in% c("NOUN", "ADJ"), select = c("textrank_id", "lemma"))
candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$textrank_id,
minhashFUN = minhash,
bands = 50)
Maybe if you run the code which is in textrank_candidates_lsh line by line, you will see which sentences fall in the same bucket in order to figure out and understand which sentences are blocking your analysis here.
https://github.com/bnosac/textrank/blob/master/R/textrank.R#L53-L84
Thanks for the suggestion. While I wasn't successful at running the textrank_candidates_lsh()
function line by line, I think I am closer to a resolution. Over the weekend, I found ~20 descriptions that were identical (e.g. "This description is not available."). Removing those did not overcome the error.
After reading your comment, I removed additional sentences that are part of unique descriptions, but are themselves repetitive (once combined with all of the other descriptions). I wrote the code below to remove the duplicate sentences.
split_into_sentences <- function(text) {
sent_tokenizer <- function(text) {
unlist(strsplit(text, "(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?)\\s", perl=TRUE))
}
sentences <- sent_tokenizer(text)
return(sentences)
}
# This was a test
original_string <- "Your input string here. It contains multiple sentences. Some sentences might be repetitive. Some sentences might be repetitive. Repetition should be removed."
sentences <- split_into_sentences(y)
dt <- data.table(sentence = sentences)
unique_dt <- unique(dt)
result_string <- paste(unique_dt$sentence, collapse = ' ')
This is successful at removing duplicate sentences, but it wasn't successful at resolving the error.
I then tried to pinpoint where the error is coming from. Through an iterative process, it seems like the 750th pair is where the error begins. Using the code below, I can get the textrank_candidates_lsh()
function to work. This is all of the code since the above chunk removing duplicate sentences.
y <- udpipe(result_string, object = udmodel$language)
y$textrank_id <- unique_identifier(y, c("doc_id", "paragraph_id", "sentence_id"))
sentences <- unique(y[, c("textrank_id", "sentence")])
terminology <- subset(y, upos %in% c("NOUN", "ADJ"))
terminology <- terminology[, c("textrank_id", "lemma")]
terminology <- subset(x, upos %in% c("NOUN", "ADJ"), select = c("textrank_id", "lemma"))
terminology <- unique(terminology)
terminology <- terminology[c(1:749),]
candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$textrank_id,
minhashFUN = minhash,
bands = 250)
Looking at the surrounding textrank_id
and lemma
columns (using terminology[c(720:755),]
, I found a few lemmas that I thought might be causing conflicts, but none of them are at the 750th row. (e.g. textrank_id 94 has two similar lemmas "Architecture" and "architecture" -- differing only by capitalization)
I also looked at the results using tail(candidates)
. The code box below is what I get using the first 749 rows of the terminology
object. I'm not sure if this is useful, but it makes me wonder if for the 750th row, there would be a combination of values for textrank_id_1
and textrank_id_2
that is not unique and is the reason for the error?
> tail(candidates)
textrank_id_1 textrank_id_2
1462 39 48
1463 4 53
1464 4 68
1465 25 41
1466 13 38
1467 43 8
Do you think it's worthwhile exploring the terminology
object to remove the offenders or is it better to keep my focus on the raw text?
I think you need to check the sentences where there are duplicates by taking the sentence_to_bucket object and looking at which sentence_id's (and the corresponding text) are mapped to the same bucket/band.
library(data.table)
x <- setDF(sentence_to_bucket)
dups <- which(duplicated(x[, c("bucket", "band")]))
dups <- paste(dups$bucket, dups$band, sep = "-")
subset(x, paste(bucket, band, sep = "-") %in% dups)
Thank you for your continued assistance and patience.
I think I am having some difficulty figuring out what these are doing/how I can adapt them to my text. Somehow I arrived at a value of 6 for the number of rows, but for the life of my can't remember how. examplehash
seems here like a simple initialization since the string that is being passed to the minhash
function (defined earlier/elsewhere using minhash_generator()
) will ultimately be our text.
examplehash <- minhashFUN("detect the n in minhashFUN")
rows <- length(examplehash) / bands
In any case, at the initialization of the sentence_to_bucket
list object, my understanding is that my objective is to compare the elements of each element in the list in order to identify duplicates.
This is the code from the documentation:
sentence_to_bucket <- split(x, sentence_id)
sentence_to_bucket <- mapply(sentence_id = names(sentence_to_bucket), sentence_to_bucket, FUN=function(sentence_id, words){
buckets <- data.table(sentence_id = sentence_id,
hash = minhashFUN(words),
band = hash_bands)
buckets <- buckets[, list(bucket = digest::digest(object = list(hashes = hash, b = band[1]))), by = list(sentence_id, band)]
buckets <- buckets[, c("sentence_id", "bucket"), with = FALSE]
buckets
}, SIMPLIFY = FALSE)
sentence_to_bucket <- data.table::rbindlist(sentence_to_bucket)
I made the following changes so that I could run it as stand-alone code:
hash_bands
using bands=500 and rows=6 x
and sentence_id
in the split()
functionmapply()
so that it could be fed into the custom function that is defined inside of mapply()
minhash <- minhash_generator(n = 500, seed = 999)
retained the "band" column so that it could be used in dups <- which(duplicated(x[, c("bucket", "band")]))
hash_bands <- unlist(lapply(seq_len(500), FUN=function(i) rep(i, times = 6)))
sentence_to_bucket <- split(terminology$lemma, terminology$textrank_id)
sentence_to_bucket <- mapply(sentence_id = names(sentence_to_bucket), words = sentence_to_bucket, FUN=function(sentence_id, words){
buckets <- data.table(sentence_id = sentence_id,
hash = minhash(words),
band = hash_bands)
buckets <- buckets[, list(bucket = digest::digest(object = list(hashes = hash, b = band[1]))), by = list(sentence_id, band)]
buckets <- buckets[, c("sentence_id", "band", "bucket"), with = FALSE]
buckets
}, SIMPLIFY = FALSE)
sentence_to_bucket <- data.table::rbindlist(sentence_to_bucket)
This is the output, which doesn't look right to me:
sentence_to_bucket sentence_id band bucket 1: sentence_id 1 9f46839ab9547d7db247335f9af87870 2: sentence_id 2 b2e532b8e924fba4829530d6170381c5 3: sentence_id 3 50e1d7457a6c77a409d3c5a79583a8f2 4: sentence_id 4 62d61537deb8460ddb2cd5307a78d55b 5: sentence_id 5 88b1940be19510f7ace2b64b55650390
996: bucket 496 444b20f8dc29d11b04a31012da078e9f 997: bucket 497 262693f8c79e1174025752f026ddf4f8 998: bucket 498 646ca60d8037b1a1386fcea0fb544131 999: bucket 499 6d91b87bd4f52805d252c255faf2b9c2 1000: bucket 500 d3b9db650dbe1bca8e29503d90dfe6c5
I was able to find numerous duplicates using this code:
sentence_to_bucket <- split(terminology$lemma, terminology$textrank_id)
duplicates <- sentence_to_bucket[duplicated(sentence_to_bucket) | duplicated(sentence_to_bucket, fromLast = TRUE)]
So I think this confirms that there are duplicates that need to be filtered out, even if it doesn't lead me to the specific sentences.
column sentence_id should contain id's of sentences, not 'sentence_id' and 'bucket'
Thanks. I knew that was a problem but didn't know what I did to make that happen. I think it is resolved now. I'm not quite to the finish line, but I should get there tomorrow. I'll close it out then. I appreciate you helping me troubleshoot this.
I'm working with a collection of about 1200 short texts. Some are a single sentence while others are a paragraph. They are mostly descriptions of scholarships and who the recipients should be. Most of the texts contain language that preference will be given to certain students. The data below are fabricated for privacy reasons but are realistic. One of our main goals is to look for analyses beyond keyword search, so at this point it's just exploratory analysis.
I've gotten the function to work on other data that I have access to and am impressed with how it's working. I don't fully understand the technical reasons why it might not be working here, but I do wonder if it's because I'm doing something it wasn't designed to do. In the joboffer example, that was a single document of reasonable length. None of my texts are that long, so summarizing them individually makes little sense. Perhaps concatenating them in the way that I did isn't an appropriate method for applying the function to my texts.
Everything runs smoothly until I get to the this chunk of code:
The error message that I'm getting is:
I'm a little confused by how to overcome this because the merge that seems to be causing the problem is being called internally by the function. The suggestions to use the
by=.EACHI
orallow.cartesian=TRUE
are not accepted by the function, which isn't too surprising. But I also haven't been able to pre-merge the data outside of thetextrank_candidates_lsh()
function using these suggestions. I did try asking chatGPT about it, but its suggestion was to remove duplicates and the solution was rather messy so I wasn't able to integrate it.Sample fabricated data (for privacy reasons): Although short, these data do faithfully reproduce the error.
The full code for reproducibility is below: