ChenghaoMou / text-dedup

All-in-one text de-duplication
Apache License 2.0
604 stars 69 forks source link

ModuleNotFoundError: No module named 'text_dedup.embedders' #4

Closed done520 closed 1 year ago

done520 commented 2 years ago

ModuleNotFoundError: No module named 'text_dedup.embedders'

when "from text_dedup.embedders.minhash import MinHashEmbedder"

ChenghaoMou commented 2 years ago

Can I ask what version of text_dedup you are using?

If it is installed from PyPi, then this shouldn't be an issue. But if you are using the main branch, then there are some breaking changes that the documentation hasn't caught up with, in which case, you can write:

from text_dedup.near_dedup import MinHashEmbedder

The documentation should be updated in the next couple of days.

done520 commented 2 years ago

thanks. I would like to ask if text_dedup can be used for academic paper duplicate testing? Have you ever tried that?

ChenghaoMou commented 2 years ago

By for academic paper duplicate testing, can you clarify what you mean exactly?

  1. Deduplicating data when the data are academic papers
  2. Using this for a research paper for testing other datasets
done520 commented 2 years ago

@ChenghaoMou I say 'academic paper duplicate testing', same as duplicate checking of graduation thesis to determine whether the article is plagiarized. and I would lie to know the performance if you have use for academic paper duplicate testing.

ChenghaoMou commented 2 years ago

@ChenghaoMou I say 'academic paper duplicate testing', same as duplicate checking of graduation thesis to determine whether the article is plagiarized. and I would lie to know the performance if you have use for academic paper duplicate testing.

Here are my two cents:

  1. Plagiarism can be different from deduplication (what is duplicated might not be plagiarized), especially when there can be direct quotes, paraphrasing or summarization in the review section, list of citations....
  2. Assume you figure out a way to strip noise from a paper. To perform plagiarism detection effectively, you would need a lot of papers to be indexed. And papers are usually longer than internet articles or typical NLP training input. I would expect a longer processing time as a result.

With that said, if I have a cleaned large set of papers, I would start with exact substring dedup, then near dedup, and finally semantic dedup, each having increasing level of compute need, and see where it leads me.

paperClub-hub commented 2 years ago

I'm sorry to reply now and many thanks for your fruitful suggestions and proposals.