dandi / dandisets

737 Dandisets, 812.2 TB total. DataLad super-dataset of all Dandisets from https://github.com/dandisets
10 stars 0 forks source link

zarrs from 000108 are still without `stats` in their .git/config #276

Closed yarikoptic closed 1 year ago

yarikoptic commented 1 year ago

This is more of a "notes taking" issue.

272 added storing stats for zarrs within .git/config to speed up computation of the entire dandiset sizes. Diff looks kosher to me and I think we had one full backup run on 000108 but so far we got only 1 such record across all zarrs

(base) dandi@drogon:/mnt/backup/dandi/dandizarrs$ grep stats */.git/config
7723d02f-1f71-4553-a7b0-47bda1ae8b42/.git/config:       stats = 2a03f2e94462acc111fc28b630560d1d0e603d83,1325,3024625521

which was odd to me since AFAIK all zarrs for a dandiset should get it.... but this one is not from 000108 but 000243! ;)

Current run of the backup for 000108 is still running, so I guess we would need to finish waiting for it to complete first for more conclusive look at the situation.

yarikoptic commented 1 year ago

I even after a complete run -- no updates to stats... I suspect that it simply doesn't do what it supposed to do: e.g. the last commits

commit 98b24e00db8296260c75ffe1435a9f91877ca721 (HEAD -> draft, github/draft, github/HEAD)
Author: DANDI User <info@dandiarchive.org>
Date:   Tue Oct 4 11:04:22 2022 +0000

    [backups2datalad] 18 files updated

 .dandi/assets-state.json |   2 +-
 .dandi/assets.json       | 396 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------------------------------------------------------------------------------------
 2 files changed, 199 insertions(+), 199 deletions(-)

commit 6cb82de90a3b587954497280eda77764f0c92517
Author: DANDI User <info@dandiarchive.org>
Date:   Fri Sep 30 16:28:31 2022 +0000

    [backups2datalad] 1 file updated

 .dandi/assets-state.json |   2 +-
 .dandi/assets.json       | 450 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------------------------------------------------------------------------------------
 2 files changed, 226 insertions(+), 226 deletions(-)

suggest updates to some (18!) files but there are no updates to files, only metadata listings. So most likely subproject states were not updated. I am running now with --verify-timestamps but I do not think it should change anything. While working on 000108 the server is busy with many git-annex processes working on zarrs, so my suspiciou is that just updates to them are not reflected in the dandiset submodules state... Actually while it is still running I see in the status of the dandiset changes to state of about 18 .zarrs -- let's see if those would get committed whenever this run finishes up.

yarikoptic commented 1 year ago

so indeed -- it finished updating but didn't commit those modified .zarr states! I did git commit --amend to commit them . any immediate fix ideas @jwodder? otherwise I will look into it tomorrow. Meanwhile -- stats still aren't populated for any of those zarrs in their .git/config's -- so yet another aspect to debug

jwodder commented 1 year ago

@yarikoptic I believe the reason this is happening is because the code added in #272 only applies to Zarrs stat'ed while stat'ing a containing Dandiset, but this (usually) only happens when running the update-github-metadata command; when running update-from-backup instead, Zarrs are instead stat'ed here, and the results are cached for use when calculating the stats of the Dandiset. I think the fix will involve passing an argument to get_stats() indicating whether the dataset is a Zarr.

yarikoptic commented 1 year ago

I think the fix will involve passing an argument to get_stats() indicating whether the dataset is a Zarr

why would it matter either it is zarr or not -- I think stats could be "cached" in .git/config regardless of the dataset "type", can't they?

jwodder commented 1 year ago

@yarikoptic They could be cached for non-Zarrs as well, but they're currently not.

yarikoptic commented 1 year ago

ok, working on PR for fixing stats situation. Could you analyze why .zarr state updates were not committed and send PR to fix that aspect?

jwodder commented 1 year ago

@yarikoptic Can you identify the logfile for the run that should have saved updates but didn't?

yarikoptic commented 1 year ago

my bet would be both of those two - the largest for yesterday and mentioning 000108:

(base) dandi@drogon:/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad$ ls -lSa 2022.10.05*log | head -n 2
-rw-r--r-- 1 dandi dandi 573173789 Oct  5 19:02 2022.10.05.00.08.10Z.log
-rw-r--r-- 1 dandi dandi  18977818 Oct  5 01:57 2022.10.05.05.50.07Z.log
(base) dandi@drogon:/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad$ grep -l 000108 2022.10.05.00.08.10Z.log 2022.10.05.05.50.07Z.log
2022.10.05.00.08.10Z.log
2022.10.05.05.50.07Z.log
jwodder commented 1 year ago

@yarikoptic

yarikoptic commented 1 year ago
yarikoptic commented 1 year ago

FWIW, I wonder if it is not some kind of datalad issue if we use datalad.save and it is not committing those changes already present in the index...

jwodder commented 1 year ago

@yarikoptic I believe the problem is due to the fact that we only commit a Dandiset backup if any(r["state"] != "clean" for r in self.ds.status()) is true, and I'm guessing that this check doesn't detect changes in uninstalled subdatasets.

yarikoptic commented 1 year ago

but AFAIK a commit did happen, it just didn't commit those paths, only the stuff under .dandi/

jwodder commented 1 year ago

@yarikoptic Oh, I was looking at the wrong part of the log file. Your guess above might be it.

yarikoptic commented 1 year ago

If it is, could you file an issue with datalad with a reproducer please?

yarikoptic commented 1 year ago

FWIW, for now I have upgraded datalad from 0.17.2 to 0.17.6 and will give update a run so we could at least possibly populate those stats records.

jwodder commented 1 year ago

@yarikoptic Issue filed: https://github.com/datalad/datalad/issues/7074

yarikoptic commented 1 year ago

I think the original issue is resolved by now:

(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandizarrs$ grep stats */.git/config | nl | tail
  3996  ff89a092-830a-440c-8953-3f868ae9397a/.git/config:       stats = 8541e2c239b44fd8afb9b4eacc32afba919a6bb8,103962,90955848774
  3997  ff8b273a-8f20-470e-b0d5-b316d6279a55/.git/config:       stats = 957130d6a4041efd9d493f7aad2d916752e11b98,51259,12432769841
  3998  ffaee4f2-f307-43cd-a035-c0dbe00b1d51/.git/config:       stats = eadc8eb2341cd0f7ee13aef0be723ae7a7d0f5fc,96936,96917197195
  3999  ffc75449-ff75-4a33-84f9-3ac6eee72d8e/.git/config:       stats = 7184689cca24370c3e00e324b0d6659a37559ef2,52129,50129090934
  4000  ffd7e5bd-7d35-41bd-8e40-36cd9d0cade5/.git/config:       stats = 91c6b31073cb80ecd4fd6472fc3a218d6a1c444b,29026,48203187348
  4001  ffdce32f-6cde-4ddc-ba29-46470f5bf7de/.git/config:       stats = ad1a4934ca2af99ab01259ccfccade581f7e81d4,69094,79471372783
  4002  ffdf2cc5-5e48-4105-b9c3-37ad9c8bcb88/.git/config:       stats = ce44ead23da0d3fccd5360cdd170172ed8b77d05,83735,101481881467
  4003  ffe18d8f-799a-4f92-aae1-700c34d53a66/.git/config:       stats = 08c6badcd444602b1d531fecab94bc06c81e0312,96616,46161959101
  4004  fff0788e-5535-4afa-8058-bb00f5687053/.git/config:       stats = e45b378a233038271d00ac36a161f0ce439a76e8,96020,26907917584
  4005  fff40b1a-e8ea-45af-8be8-4e97d901def6/.git/config:       stats = 2fd47f34532d467fe8fc10f7e27267b8245d3b50,63579,41202122498
(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandizarrs$ ls -ld *-*-*-* | nl | tail -n 1
  4005  drwxr-sr-x 1 dandi dandi 126 Jul  5  2022 fff40b1a-e8ea-45af-8be8-4e97d901def6