Organisation of subprojects

YoshiRulz commented 1 year ago

tl;dr: I'm calling for a moratorium on copying third-party code into the repo, and I'm proposing that we move a few dirs around.

We all hate code duplication. So why is it okay for us to needlessly copy entire codebases? ~~snip~~ (Cut out a bunch of back-and-forth here since I feel none of you need to be convinced this is a good idea, only that it's worth the effort to do it. edit: Apparently natt did need convincing. Scroll down for arguments.)

So let's assess the damage:

/ExternalCoreProjects/Virtu, Virtu core, < 11 kLOC, upstream is https://github.com/digital-jellyfish/Virtu/tree/master/Virtu
/ExternalProjects/NLua, Lua host, < 10 kLOC, upstream is https://github.com/NLua/NLua
/ExternalProjects/iso-parser, disc system, < 2 kLOC, upstream was on Google Code
/blip_buf, multiple cores, < 2 kLOC, upstream was on Google Code
/libmupen64plus, Mupen64Plus and plugins, < 340 kLOC, upstreams are under https://github.com/mupen64plus
/lynx, Handy core, < 12 kLOC, upstream is in tarball (or https://github.com/TASEmulators/mednafen/tree/master/src/lynx)
/psx, Octoshock core, < 68 kLOC, upstream is in tarball (or https://github.com/TASEmulators/mednafen/tree/master/src/psx)
~~/quicknes~~ replaced with submodule
/waterbox/ares64/ares, Ares64 core, < 55 kLOC, upstream is https://github.com/ares-emulator/ares
/waterbox/bsnescore/bsnes, (new) BSNES core, < 67 kLOC, upstream is https://github.com/bsnes-emu/bsnes
/waterbox/gpgx, Genplus-gx core, < 88 kLOC, upstream is https://github.com/ekeeke/Genesis-Plus-GX
/waterbox/libsnes, (old) BSNES core, < 63 kLOC, upstream is https://github.com/bsnes-emu/bsnes
/waterbox/nyma/zlib, zlib for Nyma cores, < 19 kLOC, upstream is https://github.com/madler/zlib
/waterbox/picodrive, PicoDrive core, < 79 kLOC, upstream is https://github.com/notaz/picodrive
/waterbox/tic80, TIC-80 core, < 17 kLOC, upstream is https://github.com/nesbox/TIC-80
/waterbox/uzem, Uzem core, < 5 kLOC, upstream is https://github.com/Uzebox/uzebox/tree/master/tools/uzem
/waterbox/virtualjaguar/src, Virtual Jaguar core, < 52 kLOC, upstream was on a self-hosted Git server
/wonderswan, Cygne core, < 8 kLOC, upstream is in tarball (or https://github.com/TASEmulators/mednafen/tree/master/src/wswan)

In total, the upper bound is 917 kLOC of source code, severed from its trunks and waiting to bitrot. (For reference, just the BizHawk solution is ~450 kLOC, and Mesen is ~325 kLOC.) Not listed are several binaries in /Assets/dll which are rundeps for unmanaged cores. I'll be looking at these as part of my Nix experiments. I also discounted dependencies of these projects whenever I noticed e.g. .../vendor near the bottom of the filesize list—but they are nonetheless checked-in to this repo.

I propose that from today we only use Git submodules for this purpose:

if possible, pulling in the upstream and only checking-in build scripts and patches, or else having a mirror/fork in the TASEmulators org (TODO copy and expand https://github.com/TASEmulators/BizHawk/commit/2af6bf3c34f8a31690fe7e9d8500e54bd01c671d#commitcomment-140199736); and
placing submodules in /submodules, and supplemental files like Makefiles or shell scripts in one of /ExternalProjects, /ExternalCoreProjects, or /waterbox when those are necessary.

I also propose we migrate all subprojects, starting with those listed above, to the same scheme. Adding a submodule and removing the checked-in copy can be squished into a single commit.

Moved here from my personal notes #118; see also #2423, my personal notes #11, and #2312.

vadosnaprimer commented 1 year ago

Bonus points if you manage to not lose our change history to those projects when moving them to submodule. Splitting folder history is possible in git, but re-applying changes from a different tree (even if the code matches) may be harder.

CasualPokePlayer commented 1 year ago

For many of those cores/projects, they're checked in as they have been forked and modified, some fairly lightly (e.g. tic80, which mostly just has waterbox interfacing added and minor adjustments and unneeded files excised) and some very heavily (e.g. virtualjaguar, much of which is unrecognizable compared to upstream), not to mention just ones extremely outdated from upstream (e.g. gpgx). Most of them will require forking in any case and can't just have upstream submoduled.

fork in the TASEmulators org

To be clear this is also mainly the reason I don't go immediately fork in TASEmulators org for projects And getting someone to move a repo to the org is annoying. Way eaiser to just clone into the repo and call it a day, and no downsides (besides BizHawk clone size) if upstream is dead.

vadosnaprimer commented 1 year ago

I didn't know lack of permissions was a roadblock for more cool work. I think that can be arranged.

Morilli commented 1 year ago

To be clear this is also mainly the reason I don't go immediately fork in TASEmulators org for projects [image showing insufficient permissions]

This would also be a blocker for me. I do see the advantage in terms of organization if more stuff was in submodules, but it does require all maintainers to have access to all the relevant projects.

Cleaning up the root folder of the bizhawk repo is independent from that though, and I'd definitely approve moving stuff to a unified folder. There is ExternalCoreProjects, why isn't it used more?

vadosnaprimer commented 1 year ago

If people tell me what to fork under tasemus, I can do that and invite you guys as admins to those forks. I may also discuss adding more admins to the org itself...

YoshiRulz commented 1 year ago

Well, most of the repos I linked should be forked, but pragmatically the larger ones are far more important, and as CPP said some projects have had more extensive changes than others. IMO forks can be made "as needed" i.e. the next time someone wants to update a core.

nattthebear commented 1 year ago

What's the objective here; making it easier to merge upstream changes? In that case, the difficulty comes chiefly from how much the code has diverged from upstream, not the mechanism by which they're stored.

It's a tradeoff we've made different ways different times: Heavy modifications to source lets us get off the ground with something that works for us more quickly. Light modifications to source mean we have to do work that's sometimes more awkward and difficult to integrate with the existing code better, but we can take upstream changes more easily.

I don't think there's one clear answer. The cores that sync with upstream seem to be the better long-term bet, and we can see that especially in cases where we've done it both ways and can compare the two (e.g., early Mednafen efforts vs the Nyma system.) But we also have to consider that some of these ported cores might not even exist if we hadn't taken the "easy" way out.

I'm against any specific requirements here, because we're all volunteer and the most important thing is being useful for the person who will actually work on the core. I think core porters should consider the value of various approaches, but if they want to get hacking, and they're the only one who will do it, let them.

YoshiRulz commented 1 year ago

To be clear, this issue is not arguing against the heavily-modified path. I'm just against adding hundreds of files directly to this repo.

nattthebear commented 1 year ago

I don't understand why that's worth discussion. Apart from concerns on implementing the cores, and concerns about keeping the cores synced with upstream, why does it matter?

YoshiRulz commented 1 year ago

For one, having an extra million LOC in the repo drastically increases the time it takes to clone. That affects both humans and CI. It also makes grep -r, find, etc. take longer and give more false positives.

nattthebear commented 1 year ago

I guess; aren't most devs going to check out most sobmodules anyway?

Morilli commented 1 year ago

aren't most devs going to check out most submodules anyway?

I for one only ever checkout the submodules I require. That said, I don't think the C/C++ cores in the repo contribute much to the clone size; it's probably going to be the assets folder for the most part.

YoshiRulz commented 1 year ago

I guess; aren't most devs going to check out most sobmodules anyway?

Seeing as you don't need the submodules cloned to build the solution, and we don't really advertise their existence, I'd imagine no, most devs aren't cloning the submodules. Like Morilli, I never bother to outside of the rare occasions I need to work with them.

That said, I don't think the C/C++ cores in the repo contribute much to the clone size; it's probably going to be the assets folder for the most part.

Downloading edf5f1513 as a zip (not the same as cloning, but similar, and easier to measure): Screenshot_20231112_183427 So your guess was broadly correct. (IMO that is also a problem; Citra and MAME together make up more than half of /Assets' filesize. See #3505.) But if you omit /Assets, the remaining filesize is split at about a quarter /waterbox, a quarter /src, a quarter /libmupen64plus, and the long tail. Notably, some of those largest white segments are .exes, > 18 MiB combined? Those sound like easy wins.

Morilli commented 1 year ago

Well if we're doing this, here's a probably at least somewhat accurate and representative list of what contributes most to the .git size:

Git object size list

| Root folder | Total object size |---|---| |Assets|868.6MB| |output|749.5MB| |BizHawk.MultiClient|524.1MB| |BizHawk.Client.EmuHawk|418.5MB| |src|253.2MB| |BizHawk.Emulation.Cores|172.9MB| |ppsspp|108.4MB| |waterbox|75.3MB| |BizHawk.Emulation|66.3MB| |References|61.4MB| |BizHawk.Client.Common|56.8MB| |libmupen64plus|54.7MB| |output64|48.4MB| |BizHawk.Client.EtoHawk|45.7MB| |mednafen|30.4MB| ||23.2MB| |psx|13.5MB| |yabause|13.4MB| |libgambatte|10.9MB| |genplus-gx|10.6MB| |citra|10.1MB| |Dist|9.7MB| |vbanext|8.5MB| |LuaInterface|8.4MB| |BizHawk.Emulation.Common|7.9MB| |libHawk|7.7MB| |Tools|7.3MB| |BizHawk.Emulation.DiscSystem|5.0MB| |libsnes|4.8MB| |libmednahawk|4.2MB| |Bizware|4.1MB| |ExternalCoreProjects|4.1MB| |BizHawk.Common|3.8MB| |BizHawk.Client.MultiHawk|3.1MB| |ExternalToolProjects|2.1MB| |Version|1.9MB| |genplus-gx32|1.6MB| |CpuCoreGenerator|1.6MB| |attic|1.5MB| |BizHawk.Client.ApiHawk|1.4MB| |quicknes|1.3MB| |lynx|1.2MB| |ExternalProjects|1.2MB| |wonderswan|1.1MB| Initial list was generated using `git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sed -n 's/^blob //p' | sort --numeric-sort --key=2 | cut -c 1-12,41- | $(command -v gnumfmt || echo numfmt) --field=2 --round=nearest > objects.txt` and then grouped by root folder.

nattthebear commented 1 year ago

we don't really advertise their existence

They're in .gitmodules, what other advertising would be needed?

Anyway, that download size is awful, but the problem there is mostly unrelated to submodules and this ticket, as it's about prebuilt binaries. We'll need to get all of our crusty binary build setups into an easily reproducible build system.

CasualPokePlayer commented 11 months ago

placing submodules in /submodules, and supplemental files like Makefiles or shell scripts in one of /ExternalProjects, /ExternalCoreProjects, or /waterbox when those are necessary.

A minor problem with doing this specifically is cmake being used in places, and cmake does not allow including stuff in parent directories, you can only include stuff in the current directory and subdirectories. So unless you end up also having the cmakelists also be within the /submodules folder alongside the submodule, or possibly in the BizHawk root directory itself, you're screwed.

YoshiRulz commented 11 months ago

That's dumb. And I assume you need CMake on Windows too, so you can't just use a symlink, bind mount, etc. to work around it.

TASEmulators / BizHawk

Organisation of subprojects #3818