Caching tokenizaiton, fingerprinting and indexing details for a growing corpus

anilgulecha commented 2 years ago

Hi Folks,

I came across dolos searching for a code plagiarism checking - and you've implemented exactly the approach I'd consider. This is a very good library!

From what I can parse on performance, it seems like the input (via CLI or server) is a list of N files, and the checking is roughly O(N^2).

For a very common usecase (a growing corpus of incoming code samples for a live coding problem), this can be unoptimal. If the intermediate values of tokenization/fingerprints and index can be saved for all the previous samples, for a new incoming N+1th sample, only checking against previous N is needed - bring performance to O(N) for every renewed run on the corpus.

in terms of implementation, the values can be cached in a peer file to every source file - so if there's file.js, file.js.dolo can have the cached values, which can read by CLI. Another approach could be a SQLite file in the directory. This can be enabled by a --use-cache flag.

What is the dolos team's thought on this?

rien commented 2 years ago

Hi @anilgulecha

You have an excellent suggestion and live code analysis is something we have in the back of our minds when implementing Dolos. The Index that is used to store the fingerprints and a reference to the files matching those fingerprints (and the location(s) of the match in that file) makes it possible, currently only in theory, to perform incremental analysis by re-using this index once a new file arrives. This eliminates the need to perform the tokenization and fingerprinting of the previous N files because the information needed is already there.

On a side note: the current implementation should lean towards O(N) since we're using a hash table to query the file tokens. However, as plagiarism increases this will evolve to an O(N²) behavior and collecting the final results tends to be a somewhat slow step as well.

So yes: we do intend to support incremental analysis in the future. However, we are currently prioritizing UX/UI improvements and building a persistent server. The latter will also make the incremental analysis easier to implement.

Thank you for your input, if you have a specific use case you would like us to support (for example: integration with a learning environment) please do tell us. We are currently integrating support for our own exercise platform Dodona.

anilgulecha commented 2 years ago

Thanks @rien. Yes, the specific usecase is integrating into a learning env. I can offer engineering help on getting this done. Are you able to provide bit of time towards approach, code-review and merging in the changes?

rien commented 2 years ago

Most definitely. I have sent you an email to discuss further details.

rien commented 2 years ago

I've been thinking about this integration and I think a few things need to be discussed or experimented with:

@anilgulecha do you want to compare with the latest submission of each user, or every submission that was handed in? The latter option will require a bit more thought about scaling the index.
Do you want to use the lib, the CLI or the server? Because the server will probably be rewritten and extended with more features in the future, I suggest creating a separate micro service based on the lib.
If you add a single file to the analysis, would you want to receive all similar files associated to that file only, or all similarities in the current set of files? The latter will require a bunch more data to be saved.
I am still thinking about adhering to the current storage format (CSV-files), but I think SQLite might indeed be the better option if we want to avoid keeping the full index in memory. If I find the time I will try to compare these two options.

anilgulecha commented 2 years ago

1) We don't need to define it at user level. as far as solos cli is concerned, it can define a folder or namespace. And comparisons can span a namespace. That way end users of dolos can choose to fill the namespace per their need. 2) Ideally the server supports these actions: createNamespace, addFile, addFiles, getReport. Internally on every file added, it's tokenized index and saved, and the report updated. get report should be O(1) operation, addFile is a O(N) operation, where N is size of namespace. 3) The above structure will enable many flexible usecases - without comporomising on simplicity of design. 4) I think SQLite makes sense as a store of index.

Happy to discuss more if needed

anilgulecha commented 2 years ago

Also I think those 4 actions need to be added into the library. That way both CLI or server can get the benefit of it.

dodona-edu / dolos

Caching tokenizaiton, fingerprinting and indexing details for a growing corpus #732