cmu-pasta / linux-kernel-enriched-corpus

Linux Kernel Fuzzer Corpus
https://www.proquest.com/openview/c349f2bc979f6b0efc013c31e1ceeb10/1?pq-origsite=gscholar&cbl=18750&diss=y
MIT License
124 stars 15 forks source link

Large repo size due to binary files change history #3

Closed nkiryushin closed 3 months ago

nkiryushin commented 3 months ago

Thank you for your great work with this project, it is a good tool that is useful to improve syzcaller-based bug detection!

However, there is a problem with downloading the repo originally: as of now, the history of auto-pushed binary corpora is quite large. The corpora themselves are not huge, but being binary and being auto-pushed to repo frequently by robot, it leads to naively-cloned repo being quite large (13 GB, about 2 times larger then linux kernel repo). This leads not only to large hdd used space (which could be limited on a syzcaller-based-fuzzer machine) but to long cloning time. As the robot keeps pushing binary corpora, the problem might get even worse in the future.

My solution was to use git clone --filter=blob:none option while clonning, which leads to much more manageable 581 MB repo size.

I would suggest adding blobless-clone recommendation to the DIY section in the README file.

oswalpalash commented 3 months ago

You do not need to clone the repo to use the artifacts. You may just download the releases directly. Also, you might just be able to download the latest source from https://github.com/cmu-pasta/linux-kernel-enriched-corpus/archive/refs/heads/main.zip which is currently 243 MB.