Implement preprocessing passes over lexed assembly

Project-Fungus / fungus-cli

Command-line plagiarism detection tool for ARMv7 assembly.

MIT License

0 stars 0 forks source link

Implement preprocessing passes over lexed assembly #8

Open Gadiguibou opened 1 year ago

Gadiguibou commented 1 year ago

[x] Whitespace removal
[x] Comment removal
[ ] Register renaming
[ ] Label renaming (?)
[x] ~~Symbol case-insensitivity~~ (included in lexing in #12)

louis-hildebrand commented 1 year ago

I'm not sure if we handle this already, but we probably want to make sure the relative lexer handles the different aliases for registers (R0 vs A1, R4 vs V1, R14 vs LR, etc.). It may be necessary to start with a simple search and replace (e.g., "A1" --> "R0", "LR" --> "R14") so that students can't fool the analyzer by switching between R registers and A/V registers.

Gadiguibou commented 1 year ago

Do you mean the naive lexer? The relative one just treats all of those as "symbols" anyways.

louis-hildebrand commented 1 year ago

No, I mean the relative lexer. The naive one already identifies registers by their number and therefore considers A1 and R0 to be the same, right?

Consider the following situation:

student1.s:

mov r0, #1
add r1, r0, #2

student2.s:

mov r0, #1
add r1, a1, #2

In the first case, the second occurrence of r0 will have a positive offset. In the second case, a1 does not occur earlier and so it will have an offset of 0. By mixing A and R registers, student 2 was able to copy student 1's code without being detected.

While we're at it, we should also test that mixing capitalization (e.g., sometimes a1, sometimes A1) doesn't similarly fool the relative lexer.

Gadiguibou commented 1 year ago

Capitalization is already handled but I don't think this is a priority given it requires a special rule for all register aliases and won't work on fpu registers, different architectures with more or fewer registers or different register names like armv8's etc or cortex-a's.