bnosac / textrank

Summarise text by finding relevant sentences and keywords using the Textrank algorithm
77 stars 9 forks source link

A question for the textrank_candidates_lsh #11

Closed frank0434 closed 3 years ago

frank0434 commented 3 years ago

Thanks for this great package.

Have a question about the minihash function in textrank_candidates_lsh.

I want to rank 56K+ sentences. Time cost seems unbearable if using textrank_sentences diecelty. so followed instruction in the viggette and tried to reduce the number of sentences. But seems the minihash generate duplicated bucket hashes, which cause the failure of the merge function.

>   sentence_to_bucket[,.N, by = .(bucket)][N>1]
                                  bucket  N
     1: e85dc607460b46bd724b11b3cfb1acae  2
     2: dcb3fff4eaf711ec67e1edd1172a01c3  2
     3: 72b7e79c1172d5ece5e9c89d84ffd1b2  2
     4: ba44247c05f7c3c97ba4579e66b5dca9 11
     5: 3c1335b0bf9855848759145d74ca772f  2
    ---                                    
409592: 5bb1d93f2559d471878d4456170801f9  2
409593: 4d8991c632b5158dade7734d5aac6577  2
409594: c948c8f202acc7928fa17b5df61078c1  2
409595: b3b0344fc419746d5b66b90ad2b68a23  2
409596: 9a48fa57176af1e82ea0cfbc98cee834  2
 candidates <- merge(sentence_to_bucket, sentence_to_bucket, 
                      by = "bucket", suffixes = c(".left", ".right"), all.x = TRUE, 
                      all.y = FALSE)
error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in 23882240 rows; more than 8198000 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE.

Wondering if there is something I can try?

Thank you in advance for any feedback.

jwijffels commented 3 years ago

maybe you have duplicate sentences? another approach might be to use a clustering algorithm (e.g. BTM / topicmodels) and apply textrank within each cluster

frank0434 commented 3 years ago

thanks for your reply. I did find some duplicated sentences. Tried subset the first 500 sentences, the minihash ran smoothly and textrank super quick. Guess I need to spend a bit more time on cleaning the data. close for now