exyi / anyexec2C

converts anything executable to C, C# or Python code
18 stars 2 forks source link

Feature request: compression #11

Open mvolfik opened 1 year ago

mvolfik commented 1 year ago

ReCoDex file size limits are small and Rust executables (even stripped) are big :(

vakabus commented 1 year ago

Have a look at UPX. This project implements compression really well and it should help. If I remember correctly, it worked with Recodex.

exyi commented 1 year ago

What are the current Recodex limit? It should be possible to squeeze into ~0.5 MB, but if it's 50kB then it's impossible (even C hello world is 20kB binary)

I guess I don't need to tell you, but don't forget to

[profile.release]
opt-level = "z"
codegen-units = 1
lto = true
panic = "abort"

Another major problem is that we need to link musl statically, AFAIK, since something doesn't work in Recodex when we rely on glibc.

mvolfik commented 1 year ago

Maybe the limits are set by the teachers? An exercise I'm trying to solve now is limited at 256 KiB.

Thanks Vašku, that got me below the limit, but now all cases fail with exit code 1. Can't guarantee the issue isn't an issue in my program though :) (EDIT: fixed by linking with musl)

But if I understand correctly, UPX compresses the binary while keeping it working as before, no? What I was thinking was that anyexec2C could take advantage of that it preprocesses the binary before executing it, so it could just pipe everything through GZIP or something (if implementing the decoder doesn't create more code than it saves)

(Btw I was doing opt-level = 's', but that helps very little, and forgot about the rest, so thanks for that!)

exyi commented 1 year ago

Yes, the limits are set by teachers, but seems that 256K is the default (https://github.com/ReCodEx/api/blob/036b6a53284f926186e3397c0e5e323e7788edf1/app/config/config.neon#L186)

krulis-martin commented 1 year ago

🤔 maybe it's time to lower the default...

vakabus commented 1 year ago

@mvolfik But if I understand correctly, UPX compresses the binary while keeping it working as before, no?

Kind of. It converts the executable into an self-extracting compressed archive. There is a bootstrap stage, which sort of replaces itself in-place with the decompressed code. I don't think that compressing it ourselves would do much better. We could try to invoke some standard compression tools already present on the system. That would definitely remove the need for the decompression algorithm itself to be stored in the executable. But I think that some really smart people worked on UPX and it will be really hard to beat it.

I think (I haven't measured anything) it should be possible to gain more by finding a way how to embed the executable more efficiently into the generated source code. base85 would give us about 5% better storage efficiency. Or maybe there is something even better. I always wanted to make it so that preprocessor would load the content from a standalone file during compilation. I've never figured out how to do that though. :roll_eyes:

Feel free to improve upon our work. PRs are always welcome. :wink:

@krulis-martin :thinking: maybe it's time to lower the default...

Sure, you can definitely do that. That makes me genuinely interested though. Why? What's the point? If the goal of the exercises is to teach programming and computer science, this project does actually helps with that, doesn't it? If advanced students play with the solutions like this, it motivates them to learn more advanced topics than they would otherwise. Dynamic linking (e.g. difference between Rust standard release target and musl), existence of debug symbols, self-modifying code with UXP, unpacking of executables without touching the disk, different compiler optimizations. At least I personally have learnt a lot by developing the tool and working with it. And it sure didn't make the tasks easier. Only more fun. :smile:

krulis-martin commented 1 year ago

I am just teasing you...

However, the real question is, why one should use anyexec in ReCodEx at all. ReCodEx currently supports a wide variety of runtime environments (including two types of Rust projects), so there are only two possibilities I can see:

The former case should be prevented (but that is the business of the teacher), and the latter case should be fixed by talking to the teacher, not by hacking the system. MFF ReCodEx is currently harboring roughly 450k solutions. If everybody used anyexec for their solutions, I would estimate that the total size of the files would grow at least by an order of magnitude. Just something to think about...

vakabus commented 1 year ago

However, the real question is, why one should use anyexec in ReCodEx at all. ReCodEx currently supports a wide variety of runtime environments (including two types of Rust projects)

True. And I don't have a single answer to that. We developed it roughly 5 years ago when Rust nor D were supported at all. So in our case, it was a teacher willing to let us play with it and a lack of support for desired technologies. I have no idea how it changed since then... :shrug:

MFF ReCodEx is currently harboring roughly 450k solutions. If everybody used anyexec for their solutions, I would estimate that the total size of the files would grow at least by an order of magnitude. Just something to think about...

That's a great reason for low limits. I haven't though of that. If I calculate correctly, that's at most 100GB of source code. That's... not a little. :sweat_smile:

exyi commented 1 year ago

Well, if there is no reason to use it, I wonder why @mvolfik is spending time with it...

either you are doing something you are not supposed to be doing (does not make much sense to study C++ course and submit solutions in Rust...),

If the assignment isn't supposed to be pointless, the teacher has to read the solutions anyway. Even in C# course, I needed to use anyexec, because ReCodex's handling of the .NET runtime was broken (.NET didn't know about memory limits, so it didn't bother with GC)

or the teacher has not created the assignment properly (not allowing the desired runtimes).

Or the desired runtimes don't exist or are broken. There is still no D support, last time I tried the Rust, it was borderline unusable. It was using old version of the compiler, without nightly features enabled (it was ~2018/2019 when stable Rust wasn't usable). Second thing is that Rust is nearly unusable if you can't install libraries, given their std-lib policy. You can't even generate random numbers.

MFF ReCodEx is currently harboring roughly 450k solutions. If everybody used anyexec for their solutions, I would estimate that the total size of the files would grow at least by an order of magnitude. Just something to think about...

So it would be half a terabyte if everybody had 1M submissions. What an enormous cost of 4€ per month to store it in cloud. (and you could compress/deduplicate it quite well, if you didn't force us to use UPX :joy:)

krulis-martin commented 1 year ago

If you have any suggestions on how to improve any of the runtimes, you should have passed them to your teachers or directly to recodex@mff.cuni.cz. There are some limitations caused by cost/manpower shortage (for instance, I am not willing to spend too much time on maintaining custom compilations of borderline environments when slightly older but maintained rpm package exists --- referring to Rust in this case), but as long as it is a matter of better configuration, I think that reasonable requests can be accommodated. In fact, having an expert opinion on the more exotic runtimes could really help since I do not consider myself an expert on languages like Rust, Go, Groovy, Scala, or Kotlin, nor do we have anyone else for that.

Furthermore, to my best knowledge, most courses using ReCodEx have pre-defined languages. So I would really like to know where would you use D (and if it stands to reason, we can add it). Similarly, Rust is used only in Rust seminars, where cargo is used for compilation and the environment was set (to the best of my abilities) to the teacher's requirements. If you are referring to Programming I and II courses (which are probably the only ones that allow a variety of languages) I think the time would be better spent on other endeavors for first-year students, if you do not find the assignments challenging (e.g., learning one of the prescribed languages really good instead of using Rust or D).

Storing ReCodEx data in the cloud is a no-go for many reasons, most importantly it would be illegal (technically it could easily cost about 20M a year if I remember the fines correctly). Furthermore, the data are being copied or backed up. so it is not just a matter of storage, but also a matter of network throughput and processing power, which again needs to be on-premises and we do not have an abundance of. (This answer is a shortcut and a half-truth, the situation is much more complex...)

Bottom line: at this point, using anyexec did not cause any problems yet that I am aware of and since it is not a mainstream tool, we do not have to deal with space consumption issues. It just strikes me odd how much time you are willing to spend hacking a system instead of trying to improve it for everyone (besides perhaps you might consider it fun). Until it becomes a problem, I am not going to do anything about it; however, bear in mind that it would be quite easy to detect and prevent anyexec submissions should the teachers request it (not to mention that using it without apriori permission could be interpreted as a form of cheating at some courses).

exyi commented 1 year ago

... I think the time would be better spent on other endeavors for first-year students,

I'd suggest to not stop students from doing whatever they find interesting. We are usually able to find something interesting to learn, given the environment.

I personally much prefer learn something (D and Rust) because I want to, rather than because I have to. You might have different priorities, but I find it better to learn the interesting ideas behind tech X, instead of some 80's tech with only glitches and limitations to learn. (much prefer => I'll actually learn something)

If you have any suggestions on how to improve any of the runtimes, you should have passed them to your teachers

I did that. ...can't replicate it now, since it's no longer possible to submit archived tasks :(


It's not my fault that you aren't doing incremental backups. I know this isn't going to be widely used, for most students it's easier to stick to the taught language. Since there are courses where we were supposed to submit multi-MB data files, I know that the cause of the potential anyexec limitation would be a grudge, not a technical problem.


It just strikes me odd how much time you are willing to spend hacking a system instead of trying to improve it for everyone

Because this is the way to improve the system. I can't ask you to add support for adding support for Rust, Go, D, Zig, Nim, OCaml, Swift, Rust nightly, C++ with some obscure compiler flags, C++ with unstable features, ... After you add D support, I'll ask for adding a D support with the other compiler, because it can optimize for my task better... I understand this is not the way. Submitting executables would be too obvious hack for cheaters evading plagiarism detection, anyexec submits the original source code in a comment for that reason.

If you want us to stop doing this, make the tasks open-data so students can use whatever they feel like. Not to mention that it's just pure evil to let students spend this much time debugging by exit codes :facepalm:

krulis-martin commented 1 year ago

I'd suggest to not stop students from doing whatever they find interesting. We are usually able to find something interesting to learn, given the environment.

Nobody is stoping anybody from learning anything. Moderating what can or cannot be done in ReCodEx does not have that broad impact.

I personally much prefer learn something (D and Rust) because I want to, rather than because I have to. You might have different priorities, but I find it better to learn the interesting ideas behind tech X, instead of some 80's tech with only glitches and limitations to learn. (much prefer => I'll actually learn something)

First, the only "tech" from 80' that I can see is C (and nobody is teaching you that exactly). Second, I do not believe someone born in late 90' has any experience with coding in 80' (I know I don't). Third, I have yet to meet a first-year that can code well and that is a skill that is worth learning (much more than cutting edge technologies which can change so fast).

If you have any suggestions ... I did that.

I am not aware of any requests, so I cannot comment on this.

It's not my fault that you aren't doing incremental backups...

Actually, we do. And much more. Again, please refrain from creating arguments about things you know nothing about (such as how ReCodEx works or is deployed). The point was anyexec is highly inefficient, the data consumption was just an illustration.

Because this is the way to improve the system.

The main thing you have "improved" was that we had to introduce submission size limits. Hacking often leads to escalation, rarely to improvements (from the user's perspective).

I can't ask you to add support for adding support for ...

Yes, you can. And if it was something reasonable (and addable with reasonable effort), it would have been added at some point. Whether the teachers will allow you to use it is another thing. Btw. Rust and Go are already supported for quite some time.

If you want us to stop doing this, make the tasks open-data so students can use whatever they feel like.

Please stop mixing arguments against ReCodEx and against particular assignments. The type (and the allowed runtimes) of the assignments lay solely in the hands of the lecturers. If you have problems with that, you should have taken this up with them. And as I wrote before, I am not going to do anything against anyexec, unless the teachers request such a feature.

Not to mention that it's just pure evil to let students spend this much time debugging by exit codes

You should have taken this up with your teacher who created the assignments. It is possible to let students see the judge logs. It is also possible to write custom judges to provide customized feedback for students. And of course, the assignments should be designed in a way that debugging in ReCodEx should not be necessary (though this is not always entirely possible).


EDIT: removed PII

exyi commented 1 year ago

I'd suggest to not stop students from doing whatever they find interesting. We are usually able to find something interesting to learn, given the environment.

Nobody is stoping anybody from learning anything

Didn't say that, it is explicitly stopping me from implementing the assignment in X and forcing me to use C.


Whatever, to be more on the topic of submission size

It's not my fault that you aren't doing incremental backups...

Actually, we do.

Sorry, I misunderstood your previous comment.

So, since the backups are done incrementally, I don't see how bandwidth or processing power could be a bottleneck. Given the main problem is storage:

To compensate for that I propose I give you a pair of 8TB HDDs, more the enough to accommodate next 5y even if everybody used anyexec for everything and bundled 10M binaries. My only condition is that minimum of the limit is set to 10M and there is no further action against anyexec usage (from ReCodex side, teachers are free to reject it when doing code review). If the limit is lowered in the foreseeable future (lets say 5y), I request a public explanation why and returning of the HDDs (or equivalent ones).

Additionally, I will not add any compression or upx helper options, nor make it a default. This should allow you to compress/deduplicate the larger submissions well, in the unlikely case that anyexec got used by more than 5 students.

Obviously, this won't feed the drives with electricity and so on, but I think it's well compensated by the fact that it's very unlikely there will be more than a few hundreds anyexec solutions in the future. Originally I thought about giving you a pair of 4GB flash drives, enough to store all anyexec solutions to this date and most likely in the future (but it would just be another provocation :grin:). You can obviously use the drives for all the submissions, or whatever else you like to.

Deal?

krulis-martin commented 1 year ago

it is explicitly stopping me from implementing the assignment in X and forcing me to use C.

No, it is not. You can implement whatever you want. You just won't be able to submit it to ReCodEx, so you need to have another arrangement with your teacher. If it is warranted that the actual language/technology does not matter for a particular assignment and the teacher has the spare capacity...

Btw. when I was a T.A. in the Programming course, I allowed only languages that I understood so I can give reasonable feedback to the students. Yes, you need to live with the limitation that we, stupid teachers, do not know all the languages and do not have the time to learn a new language or technology every time a student is insulted that mainstream technologies should be used for assignment solutions.

...Deal?

No deal. (assuming that this was not a bad joke)

  1. I cannot and will not affect existing exercises in this way without explicit consent from the teachers (and I am quite sure no one will give me that). If you want a particular limit raised, ask the teacher that assigned the exercise.
  2. Your offer of HDDs is irrelevant and quite ridiculous. Even if you make sure they will be compatible with our array (common SATA 8T HDDs definitely won't be), I see no way how to legally accept it. Making a financial donation to the faculty would be also useless since I do not know how to channel the donation to us. Not to mention this could be construed as an attempt at bribery.
  3. The backups are not the only problem. For instance, the entire database (and the files) needs to be copied to the test environment every night (and that cannot be done incrementally).

Btw. something else for consideration: I keep reading about what "you want". I respect that you express your desires, but you also should understand that there are limits to what we can do (in ReCodEx, at MFF, in general). We are experiencing severe money and personnel shortages while the number of students is rising and we are trying to do the best we can under the given circumstances, so expecting completely individual treatment in courses is simply not feasible. I always encourage the students to study other things than what is being thought in the classes (e.g., new languages), but after they mastered what is demanded of them (i.e., after they show they can solve the assignment in the prescribed language) and they need it to do on their own (at least I do not have the capacity to help).

exyi commented 1 year ago

So I did the work to find out how much "harm" I did by using anyexec2C. I don't think there's anybody else who'd use anyexec for every single Prog 1/2 excercise, plus a few more. I also did use D, which is bit less space efficient than Rust and I used anyexec it to send some large precomputed tables in one assignment.

If you want to check my methodology: I downloaded all submissions from recodex using #12, then ran:

wc (rg --files-with-matches anyexec2c)
wc **

Result is: 16.18%

After compression it's: 6.87% (lzip archive, default settings)

Even if everyone was as "extreme" as I was, your space consumption would rise by one sixth. Really is this a major problem?


it is explicitly stopping me from implementing the assignment in X and forcing me to use C.

No, it is not. You can implement whatever you want. ... and the teacher has the spare capacity...

Wouldn't it be easier if only there was a way to submit it to ReCodex, so the teacher doesn't have to be bothered too much about it? What you propose has much higher barrier on both sides, so it's effectively a stopper. I tried (unlike you, I suppose).

The backups are not the only problem. For instance, the entire database (and the files) needs to be copied to the test environment every night (and that cannot be done incrementally).

:joy:

Why did you bother with incremental backups, then? Something's telling me that anyexec is not causing you the resource problems, weird engineering practices are. Don't tell me you need all production data in testing, that it cannot be done incrementally or at least done less often... (Not to mention that production data in testing DB are one of the bad ideas when it comes to security/privacy protection.)

I keep reading about what "you want". I respect that you express your desires, but you also should understand that there are limits to what we can do

Man, what did I ask you for?

Literally the only thing I actually want is you to understand that this isn't hacking, it's saving you work to do and making the ReCodex assignments more bearable for students with a mindset like me. How many resource does that take?

I always encourage the students to study other things than what is being thought in the classes (e.g., new languages), but after they mastered what is demanded of them (...) and they need it to do on their own

Can't tell, I have heard different reviews (Arduino interrupts...). How exactly do you think I should try implementing the assignments without the test data nor the test env? How much time and energy do you think students have after fighting with the red :-1: for 2 days? Just enough to do it again voluntarily?

It's like the czech langauge teacher: "noo, you shoudn't read those mandatory books in the original language, you need to learn Czech first, Engish/German second"

When I already must spend time on the assignment, I want to maximize the utility for me. I just don't value being able to navigate all the C (with vector) glitches as much as having a general ability use the technology available. You might have different priorities, but why force it on everybody else? Plus, doing something at least a bit interesting means that I'll actually learn something, not just make random changes and copy paste from SO, and I don't this I'm the only one with such personality traits.

Does it even matter if students have the ability to code using your favorite tech, not any other? Is this bit of freedom of choice the thing you are actually fighting against?