dodona-edu / dolos

:detective: Source code plagiarism detection
https://dolos.ugent.be
MIT License
237 stars 30 forks source link

Request programming language support here #1029

Open rien opened 1 year ago

rien commented 1 year ago

If you want to use a programming language that Dolos does not support yet, please ask here! This helps us with prioritizing which programming languages we should focus on first.

We currently ship Dolos with the following programming languages:

If your programming language is not in the list of languages supported out-of-the box, there is a high possibility that a tree-sitter parser already exists for that language. If that is the case, it should be easy to add support for your language.

In any case, let us know which languages you want to use with Dolos!

BTWS2 commented 1 year ago

Adding support for HTML would be great, because the HTML judge allows for exercises with different solutions (e.g. add a title (doesn't matter which text), add at least 3 items to a list, ...).

rien commented 1 year ago

@BTWS2 there exists a parser for HTML, but it wouldn't work good for plagiarism detection because the parser ignores tag names, attribute names, the exact content itself, ...

This is how tree-sitter converts an example HTML page: image

Especially if the underlying structure of the analyzed HTML files is expected to be very similar, using this parser would result in very high similarities. Using this parser, Dolos reports that the homepage of GitHub and the homepage of Dodona have a similarity of 88%.

I would prefer to stick to languages that work good with Dolos. However if you want, you can try it out yourself by installing tree-sitter-html using npm or yarn, Dolos is able to automatically detect and use this parser if it is available.

BTWS2 commented 1 year ago

@rien No problem, thank you for the insight.

anilgulecha commented 1 year ago

@rien can there support simple text?

Regular assignments (like essays) is the usecase I'm thinking for this. The tokenizer is as simple as splitting by space? Or sentences split by "."

rien commented 1 year ago

@anilgulecha Dolos is specifically made for plagiarism detection on source code. There are tools that should perform better on just text than Dolos.

That said, we do indeed have a tokenizer that does split on spaces which you can use by passing --language char. However, in that case you might be better / faster using the diff command or something else that does string matching.

anilgulecha commented 1 year ago

Thanks for the char recommendation. will try it out.

yafuerst commented 1 year ago

I need support for the language Modelica. It is not supported by tree-sitter, but there are two parsers on github: https://github.com/OpenModelica/tree-sitter-modelica https://github.com/mtiller/modelica-tree-sitter

I managed to get them running using tree-sitter directly, but I had no luck adding them to dolos yet. Do you have any tips? I am on Linux if thats important.

rien commented 1 year ago

@yafuerst Dolos will try to find the parser with a fitting name (tree-sitter-${name}) if you add the language with the -l option. It will look in the node_modules accessible to Dolos (local, per user, global).

If you've managed to get them working but if Dolos doesn't work, you can try "installing" the parser fro your user or globally with npm link or npm link --global. For the modelica-tree-sitter parser to be detected by Dolos, you will have to change the name to tree-sitter-modelica.

Let me know if it doesn't work and we'll figure it out.

alexey-sh commented 11 months ago

what about vue, react?

rien commented 11 months ago

@alexey-sh since those languages use multiple languages (template syntax, css, html, ...), tree-sitter does not handle those out-of-the box, so some additional work is required for them to work.

In addition, since HTML and CSS often have a lot of common code fragments between submissions, Dolos isn't very good in detecting plagiarism with them (you get a lot of false positives).

However, we do plan on changing Dolos under the hood to support these kind of languages in the future!

DhruvDh commented 10 months ago

Hi, I am running dolos with the following version -

Dolos v2.3.0
Node v18.16.0
Tree-sitter v0.20.1

npm only has tree-sitter-java@0.19.1, and it seems dolos cannot find it because of this. Any workaround? I have tried installing it locally and globally with pnpm and npm.

rien commented 10 months ago

@DhruvDh with the way we currently integrate tree-sitter languages, we will have to wait on tree-sitter-java to publish a new release. Recently, someone already made an issue with the maintainers of that parser to create a new release, let's hope they publish it soon: https://github.com/tree-sitter/tree-sitter-java/issues/163

As an alternative, you can try cloning this repository and updating the base tree-sitter version manually. However, that van be cumbersome.

We already have some ideas how to avoid this problem with Dolos in the future (see #1028), however we've not started on the implementation of this solution yet.

DhruvDh commented 10 months ago

I was able to solve this by the following, I am not confident I understand it correctly, so I won't attempt an explanation.

# a fork with package.json version set to 0.20.1
pnpm install DhruvDh/tree-sitter-java 
pnpm rebuild
pnpm install @dodona/dolos
pnpm exec dolos run info.csv
nachiket commented 3 months ago

Would you support Verilog? Thanks.

rien commented 3 months ago

Hi @nachiket, thanks for your suggestion! There is an official verilog parser available, so this is definitely possible.

We'll put it in our schedule and will let you know when support for verilog has landed.