DavHau / pypi-deps-db

Dependency DB for python packages on pypi
MIT License
66 stars 40 forks source link

remove whitespace #9

Closed milahu closed 2 years ago

milahu commented 2 years ago

lets save some space and time (parsing time grows linear with file size)

$ du -sh pypi-deps-db*
1.2G    pypi-deps-db-master
832M    pypi-deps-db-master-noindent
778M    pypi-deps-db-master-nowhitespace

$ du -sh *.zip
82M pypi-deps-db-master.zip
72M pypi-deps-db-master-noindent.zip
71M pypi-deps-db-master-nowhitespace.zip

after running

#! /bin/sh
# noindent
set -e
find . -name '*.json' | while read f; do
  sed -i -E 's/^\s+//' $f
done

and

#! /bin/sh
# nowhitespace
set -e
find . -name '*.json' | while read f; do
  jq -c . $f | sponge $f
done

as expected ...

$ head -c100 pypi-deps-db-master/sdist/aa.json
{
  "aa-statistics": {
    "0.2.1": {
      "27": {
        "install_requires": [
          "allianc

$ head -c100 pypi-deps-db-master-noindent/sdist/aa.json 
{
"aa-statistics": {
"0.2.1": {
"27": {
"install_requires": [
"allianceauth>=2.8.0"
]
},
"36": "27",

$ head -c100 pypi-deps-db-master-nowhitespace/sdist/aa.json
{"aa-statistics":{"0.2.1":{"27":{"install_requires":["allianceauth>=2.8.0"]},"36":"27","37":"27","38

to read the files with whitespace

$ jq -r . pypi-deps-db-master-compact/sdist/aa.json | head -c100
{
  "aa-statistics": {
    "0.2.1": {
      "27": {
        "install_requires": [
          "allianc
DavHau commented 2 years ago

I think with this change each commit will be large as it replaces the whole line. Also got diffability is broken by that.

milahu commented 2 years ago

each commit will be large as it replaces the whole line

git stores snapshots, not diffs git packfile deltas are character-based, not line-based

a benchmark on git compression

```sh #! /usr/bin/env bash # compare storage format multiline=true # store one value per line # $ head -c 10000 test-git-insert-one-line-multiline-true/a.txt | wc -l # 15 # size summary: # 267M .git unpacked # 2.7M .git after gc # 2.9M .git after repack # 2.9M .git after reflog # 1.3M .git after gc --aggressive --prune=now #multiline=false # store all data on one line # $ head -c 10000 test-git-insert-one-line-multiline-false/a.txt | wc -l # 0 # size summary: # 268M .git unpacked # 2.7M .git after gc # 3.0M .git after repack # 3.0M .git after reflog # 1.3M .git after gc --aggressive --prune=now d="test-git-insert-one-line-multiline-$multiline" f=a.txt i_max=1000 # keep headers small # help with compression export GIT_AUTHOR_NAME="x" export GIT_AUTHOR_EMAIL="" export GIT_AUTHOR_DATE="1970-01-01 00:00:00 +0000" export GIT_COMMITTER_NAME="x" export GIT_COMMITTER_EMAIL="" export GIT_COMMITTER_DATE="1970-01-01 00:00:00 +0000" rm -rf $d mkdir $d ( cd $d git init >/dev/null echo hello world >$f git add $f >/dev/null git commit -m a >/dev/null echo add commits 1 to $i_max # real 10m time for ((i=0;i&2 #echo "lines $lines -> cut $cut" >&2 #printf "%s " "$cut" >&2 headflag=$(if $multiline; then printf "%s" "-n"; else printf "%s" "-c"; fi) start="$(head $headflag $((cut - 1)) $f)"; end="$(tail $headflag +$cut $f)"; format="%s$(if $multiline; then printf '\\n'; else printf " "; fi)" ( printf "$format" "$start" ; printf "$format" "i=$i h=$h"; printf "$format" "$end"; ) >$f; git commit -m "a" -a >/dev/null # send one byte to pv printf . done \ | pv -s $i_max >/dev/null set -x size_summary="" s=$(du -sh .git) size_summary+="$s unpacked" echo "unpacked size: $s" time git gc # >/dev/null 2>&1 s=$(du -sh .git) size_summary+=$'\n'"$s after gc" echo "size after gc: $s" git config core.looseCompression 0 git config pack.compression 6 time git repack -a -d -F s=$(du -sh .git) size_summary+=$'\n'"$s after repack" echo "size after repack: $s" time git reflog expire --all --expire=now s=$(du -sh .git) size_summary+=$'\n'"$s after reflog" echo "size after reflog: $s" time git gc --aggressive --prune=now s=$(du -sh .git) size_summary+=$'\n'"$s after gc --aggressive --prune=now" echo "size after gc --aggressive --prune=now: $s" echo "size summary:"$'\n'"$size_summary" ) # cd $d # less -S test-git-insert/$f ```

diffability is broken

human-readability is less important than runtime performance

milahu commented 2 years ago

nevermind : )

the time (and space) for json whitespace pales in comparison to build time (and space)

DavHau commented 2 years ago

Hey, I'm definitely thankful you looked into this. It's good to have the numbers now. But yes I agree, I think it's not really worth it right now. The main bottleneck for me often is networking, but a reduction from 82M to 71M, isn't enough to sacrifice human readability (without requiring extra tools etc.).

milahu commented 2 years ago

we could update the database with git pull

cargo is doing this with the crates.io-index repo

one problem with fetchGit and friends is, none of them return the commit object which is needed to restore a functional git repo for git pull

git-update-demo.nix ```nix { pkgs ? import {} }: let src = pkgs.stdenv.mkDerivation rec { name = "git-${rev}"; url = "https://github.com/milahu/random"; ref = "master"; rev = "3a890a9b5d9fdfaa4fee92d8f80c36336a2afd7f"; # initial commit passthru = { inherit url ref rev; }; #outputHash = rev; # this is the goal. use the git commit hash to validate the files #outputHashMode = "git"; # TODO implement #outputHashAlgo = "sha1"; # default for outputHashMode = "git"; may be "sha256" in future # using only the commit hash already works with # builtins.fetchGit { url = "..."; rev = "..."; } # but the commit object is missing # so we cannot reconstruct a functional git repo # from which we could fetch "cheap updates" # (cheaper than fetching a full clone) # workaround: make this work with current nix ... outputHash = "sha256-PyESQWMUQZ/ODHjv1caVMPSgeQ9DKDNN+lLEslFUCnM="; # this i want to avoid outputHashMode = "recursive"; outputHashAlgo = "sha256"; phases = "buildPhase"; buildPhase = '' ( set -x # debug mkdir $out cd $out git -c init.defaultBranch=main init git remote add origin ${url} git fetch origin ${rev} git -c advice.detachedHead=false checkout ${rev} # extract the commit object git cat-file commit ${rev} >/tmp/COMMIT rm -rf .git mkdir .git mv /tmp/COMMIT .git/COMMIT # .git/COMMIT is a "pseudo standard" location for the uncompressed commit object # when nix wants to verify the files by their commit hash, # this is *the* location for the commit object # alternative: store the commit object in a nix passthru attribute # alternative: use the actual standard git location # .git/objects/xx/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # to store the zlib-compressed commit object # -> we waste 2x4KB = 8KB for the two folders .git/objects and .git/objects/xx # probably not needed. redundant with nix passthru attributes # kind-of useful for build tools, to detect the source version # but "git rev-parse HEAD" does not work # so mkDerivation's unpackPhase would need to reconstruct a functional .git folder echo ${rev} >.git/HEAD ) ''; buildInputs = [ pkgs.git pkgs.cacert # fix: SSL certificate problem: unable to get local issuer certificate ]; }; in pkgs.stdenv.mkDerivation rec { name = "demo-app"; inherit src; unpackPhase = '' echo "restoring git repo from $src" cp -r $src source cd source chmod -R +w . mv .git/COMMIT /tmp/COMMIT rev=$(cat .git/HEAD) rm -rf .git git -c init.defaultBranch=main init # restore blobs and trees git add . GIT_AUTHOR=nix GIT_AUTHOR_EMAIL= GIT_COMMITTER=nix GIT_COMMITTER_EMAIL= git commit -m restore # restore commit mkdir .git/objects/''${rev:0:2} || true ( printf "commit %s\0" $(stat -c%s /tmp/COMMIT); cat /tmp/COMMIT ) | zlib-flate -compress=1 >.git/objects/''${rev:0:2}/''${rev:2} rm /tmp/COMMIT git -c advice.detachedHead=false checkout $rev git branch -D main git branch main git checkout main echo "restored git repo at revision $rev" ''; buildPhase = '' echo "demo: git repo is working" ( set -x git rev-parse HEAD git status git branch | cat #git log # error: cannot run less: No such file or directory git log | cat git fsck --full echo "commit object:" git cat-file commit $rev echo "all git objects, including the 'restore' commit:" git cat-file --batch-check --batch-all-objects # fetch commits -> cheap update git remote add origin ${src.url} # fetch all new commits, then go back #git pull --ff-only origin ${src.ref}:main #git checkout d6f7df5007365378fa31ce825f378e323e381079 # fetch some commits # initial commit + 3 git pull --ff-only origin d6f7df5007365378fa31ce825f378e323e381079:main git log | cat ) ''; buildInputs = [ pkgs.git pkgs.cacert # fix: SSL certificate problem: unable to get local issuer certificate pkgs.qpdf # zlib-flate ]; # allow "git pull" outputHash = ""; outputHashMode = "recursive"; outputHashAlgo = "sha256"; } ```

this is kind-of a workaround for my rant https://discourse.nixos.org/t/nix-sha256-is-bug-not-feature-solution-a-global-cas-filesystem/15791 where i suggest to store deep clones of source repos outside of /nix/store → compression + cheap updates

getting rid off sha256 for deep clones should be possible, technically

also related https://github.com/NixOS/nixpkgs/issues/89380