dandi / dandi-archive

DANDI API server and Web app
https://dandiarchive.org
13 stars 9 forks source link

repository field is not filled in #1516

Open yarikoptic opened 1 year ago

yarikoptic commented 1 year ago

I have proposed https://github.com/dandi/dandi-archive/pull/1103 in the past which never saw a single feedback comment and later I closed under assumption (I made no record of evidence) that the issue was resolved.

While working on https://github.com/dandi/dandi-schema/pull/100#issuecomment-1452421004 now I identified that

although many (96) dandisets have `repository` mentioned among assets metadata records ```shell (dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ grep -l repository */.dandi/assets.json | xargs ls -l | nl 1 -rw-r--r-- 1 dandi dandi 325021 Mar 2 01:50 000003/.dandi/assets.json 2 -rw-r--r-- 1 dandi dandi 250075 Mar 2 01:50 000004/.dandi/assets.json 3 -rw-r--r-- 1 dandi dandi 456958 Mar 2 01:50 000005/.dandi/assets.json 4 -rw-r--r-- 1 dandi dandi 164554 Mar 2 01:50 000006/.dandi/assets.json 5 -rw-r--r-- 1 dandi dandi 167003 Mar 2 01:50 000007/.dandi/assets.json 6 -rw-r--r-- 1 dandi dandi 548399 Mar 2 01:50 000009/.dandi/assets.json 7 -rw-r--r-- 1 dandi dandi 512965 Mar 2 01:50 000010/.dandi/assets.json 8 -rw-r--r-- 1 dandi dandi 306587 Mar 2 01:50 000011/.dandi/assets.json 9 -rw-r--r-- 1 dandi dandi 677951 Mar 2 01:50 000012/.dandi/assets.json 10 -rw-r--r-- 1 dandi dandi 165751 Mar 2 01:50 000013/.dandi/assets.json 11 -rw-r--r-- 1 dandi dandi 627372 Mar 2 01:50 000015/.dandi/assets.json 12 -rw-r--r-- 1 dandi dandi 315254 Mar 2 01:50 000016/.dandi/assets.json 13 -rw-r--r-- 1 dandi dandi 118042 Mar 2 01:50 000017/.dandi/assets.json 14 -rw-r--r-- 1 dandi dandi 88286 Mar 2 01:50 000019/.dandi/assets.json 15 -rw-r--r-- 1 dandi dandi 12914541 Mar 2 01:50 000020/.dandi/assets.json 16 -rw-r--r-- 1 dandi dandi 610545 Mar 2 01:50 000021/.dandi/assets.json 17 -rw-r--r-- 1 dandi dandi 481904 Mar 2 01:50 000022/.dandi/assets.json 18 -rw-r--r-- 1 dandi dandi 909895 Mar 2 01:50 000023/.dandi/assets.json 19 -rw-r--r-- 1 dandi dandi 2918 Mar 2 01:50 000025/.dandi/assets.json 20 -rw-r--r-- 1 dandi dandi 95494014 Mar 2 11:04 000026/.dandi/assets.json 21 -rw-r--r-- 1 dandi dandi 2553 Mar 2 01:50 000027/.dandi/assets.json 22 -rw-r--r-- 1 dandi dandi 6781 Mar 2 01:50 000028/.dandi/assets.json 23 -rw-r--r-- 1 dandi dandi 14481 Mar 2 01:50 000029/.dandi/assets.json 24 -rw-r--r-- 1 dandi dandi 15811 Mar 2 01:51 000034/.dandi/assets.json 25 -rw-r--r-- 1 dandi dandi 598609 Mar 2 01:51 000035/.dandi/assets.json 26 -rw-r--r-- 1 dandi dandi 127632 Mar 2 01:51 000036/.dandi/assets.json 27 -rw-r--r-- 1 dandi dandi 348131 Mar 2 01:52 000039/.dandi/assets.json 28 -rw-r--r-- 1 dandi dandi 68648 Mar 2 01:52 000041/.dandi/assets.json 29 -rw-r--r-- 1 dandi dandi 188529 Mar 2 01:52 000043/.dandi/assets.json 30 -rw-r--r-- 1 dandi dandi 26140 Mar 2 01:52 000044/.dandi/assets.json 31 -rw-r--r-- 1 dandi dandi 23144712 Mar 2 01:54 000045/.dandi/assets.json 32 -rw-r--r-- 1 dandi dandi 2537 Mar 2 01:53 000048/.dandi/assets.json 33 -rw-r--r-- 1 dandi dandi 260685 Mar 2 01:54 000049/.dandi/assets.json 34 -rw-r--r-- 1 dandi dandi 160832 Mar 2 01:54 000050/.dandi/assets.json 35 -rw-r--r-- 1 dandi dandi 1878 Mar 2 01:54 000051/.dandi/assets.json 36 -rw-r--r-- 1 dandi dandi 911288 Mar 2 01:54 000052/.dandi/assets.json 37 -rw-r--r-- 1 dandi dandi 1119664 Mar 2 01:54 000053/.dandi/assets.json 38 -rw-r--r-- 1 dandi dandi 265174 Mar 2 01:54 000054/.dandi/assets.json 39 -rw-r--r-- 1 dandi dandi 168730 Mar 2 01:54 000055/.dandi/assets.json 40 -rw-r--r-- 1 dandi dandi 168872 Mar 2 01:55 000056/.dandi/assets.json 41 -rw-r--r-- 1 dandi dandi 29168 Mar 2 01:55 000058/.dandi/assets.json 42 -rw-r--r-- 1 dandi dandi 192063 Mar 2 01:55 000059/.dandi/assets.json 43 -rw-r--r-- 1 dandi dandi 259609 Mar 2 01:55 000060/.dandi/assets.json 44 -rw-r--r-- 1 dandi dandi 165866 Mar 2 01:55 000061/.dandi/assets.json 45 -rw-r--r-- 1 dandi dandi 2198 Mar 2 01:55 000064/.dandi/assets.json 46 -rw-r--r-- 1 dandi dandi 1657 Mar 2 01:55 000065/.dandi/assets.json 47 -rw-r--r-- 1 dandi dandi 6656 Mar 2 01:55 000066/.dandi/assets.json 48 -rw-r--r-- 1 dandi dandi 114436 Mar 2 01:55 000067/.dandi/assets.json 49 -rw-r--r-- 1 dandi dandi 4762 Mar 2 01:55 000068/.dandi/assets.json 50 -rw-r--r-- 1 dandi dandi 2640 Jan 27 2022 000069/.dandi/assets.json 51 -rw-r--r-- 1 dandi dandi 23257 Mar 2 01:55 000070/.dandi/assets.json 52 -rw-r--r-- 1 dandi dandi 3494 Mar 2 01:55 000105/.dandi/assets.json 53 -rw-r--r-- 1 dandi dandi 1690 Mar 2 01:55 000107/.dandi/assets.json 54 -rw-r--r-- 1 dandi dandi 11434044 Feb 23 09:34 000108/.dandi/assets.json 55 -rw-r--r-- 1 dandi dandi 1001388 Mar 2 01:55 000109/.dandi/assets.json 56 -rw-r--r-- 1 dandi dandi 172201 Mar 2 01:55 000115/.dandi/assets.json 57 -rw-r--r-- 1 dandi dandi 411942 Mar 2 01:55 000117/.dandi/assets.json 58 -rw-r--r-- 1 dandi dandi 11652 Mar 2 01:55 000122/.dandi/assets.json 59 -rw-r--r-- 1 dandi dandi 12752 Mar 2 01:55 000126/.dandi/assets.json 60 -rw-r--r-- 1 dandi dandi 6888 Mar 2 01:55 000127/.dandi/assets.json 61 -rw-r--r-- 1 dandi dandi 6476 Mar 2 01:55 000128/.dandi/assets.json 62 -rw-r--r-- 1 dandi dandi 6112 Mar 2 01:55 000129/.dandi/assets.json 63 -rw-r--r-- 1 dandi dandi 6624 Mar 2 01:55 000130/.dandi/assets.json 64 -rw-r--r-- 1 dandi dandi 6481 Mar 2 01:55 000138/.dandi/assets.json 65 -rw-r--r-- 1 dandi dandi 6487 Mar 2 01:55 000139/.dandi/assets.json 66 -rw-r--r-- 1 dandi dandi 6479 Mar 2 01:55 000140/.dandi/assets.json 67 -rw-r--r-- 1 dandi dandi 1915543 Mar 2 01:55 000142/.dandi/assets.json 68 -rw-r--r-- 1 dandi dandi 83240 Mar 2 01:55 000143/.dandi/assets.json 69 -rw-r--r-- 1 dandi dandi 4481 Mar 2 01:55 000144/.dandi/assets.json 70 -rw-r--r-- 1 dandi dandi 29971 Mar 2 01:55 000147/.dandi/assets.json 71 -rw-r--r-- 1 dandi dandi 13357 Mar 2 01:55 000149/.dandi/assets.json 72 -rw-r--r-- 1 dandi dandi 2141451 Mar 2 01:56 000165/.dandi/assets.json 73 -rw-r--r-- 1 dandi dandi 81047 Mar 2 01:55 000166/.dandi/assets.json 74 -rw-r--r-- 1 dandi dandi 15667 Mar 2 01:55 000167/.dandi/assets.json 75 -rw-r--r-- 1 dandi dandi 454401 Mar 2 01:56 000168/.dandi/assets.json 76 -rw-r--r-- 1 dandi dandi 357200 Mar 2 01:56 000173/.dandi/assets.json 77 -rw-r--r-- 1 dandi dandi 3364 Mar 2 01:56 000206/.dandi/assets.json 78 -rw-r--r-- 1 dandi dandi 56896 Mar 2 01:56 000207/.dandi/assets.json 79 -rw-r--r-- 1 dandi dandi 764133 Mar 2 01:56 000209/.dandi/assets.json 80 -rw-r--r-- 1 dandi dandi 2735742 Mar 2 01:56 000212/.dandi/assets.json 81 -rw-r--r-- 1 dandi dandi 293264 Mar 2 01:56 000213/.dandi/assets.json 82 -rw-r--r-- 1 dandi dandi 2924657 Mar 2 01:56 000217/.dandi/assets.json 83 -rw-r--r-- 1 dandi dandi 429693 Mar 2 01:56 000218/.dandi/assets.json 84 -rw-r--r-- 1 dandi dandi 178917 Mar 2 01:56 000219/.dandi/assets.json 85 -rw-r--r-- 1 dandi dandi 95422 Mar 2 01:56 000220/.dandi/assets.json 86 -rw-r--r-- 1 dandi dandi 888600 Mar 2 01:56 000221/.dandi/assets.json 87 -rw-r--r-- 1 dandi dandi 71722 Mar 2 01:56 000223/.dandi/assets.json 88 -rw-r--r-- 1 dandi dandi 263858 Mar 2 01:56 000228/.dandi/assets.json 89 -rw-r--r-- 1 dandi dandi 9565293 Mar 2 01:56 000231/.dandi/assets.json 90 -rw-r--r-- 1 dandi dandi 294890 Mar 2 01:56 000232/.dandi/assets.json 91 -rw-r--r-- 1 dandi dandi 1184124 Mar 2 01:56 000233/.dandi/assets.json 92 -rw-r--r-- 1 dandi dandi 37404 Mar 2 01:56 000235/.dandi/assets.json 93 -rw-r--r-- 1 dandi dandi 31704 Mar 2 01:56 000236/.dandi/assets.json 94 -rw-r--r-- 1 dandi dandi 34586 Mar 2 01:56 000237/.dandi/assets.json 95 -rw-r--r-- 1 dandi dandi 19630 Mar 2 01:56 000238/.dandi/assets.json 96 -rw-r--r-- 1 dandi dandi 1917176 Mar 2 01:56 000239/.dandi/assets.json ```
which have a spectrum of schemaVersions, all the way to 0.6.3 ```shell (dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ grep -l repository */.dandi/assets.json | xargs ls -l | grep -v '2 ... .. ' | sed -e 's,.* ,,g' | xargs grep -h -A1 schemaVersion | grep '^"0' | sort | uniq -c 8076 "0.4.4", 477 "0.5.1", 65 "0.5.2", 18784 "0.6.0", 2911 "0.6.2" 20123 "0.6.2", 3 "0.6.3" 30802 "0.6.3", ```
there is 35 with assets listed and no repository mentioned ```shell (dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ grep -L repository */.dandi/assets.json | xargs ls -l | grep -v '2 ... .. ' | nl 1 -rw-r--r-- 1 dandi dandi 4134879 Mar 2 01:50 000008/.dandi/assets.json 2 -rw-r--r-- 1 dandi dandi 318311 Mar 2 01:51 000037/.dandi/assets.json 3 -rw-r--r-- 1 dandi dandi 141404 Mar 2 01:55 000148/.dandi/assets.json 4 -rw-r--r-- 1 dandi dandi 172636 Mar 2 01:56 000226/.dandi/assets.json 5 -rw-r--r-- 1 dandi dandi 7700 Mar 2 01:56 000243/.dandi/assets.json 6 -rw-r--r-- 1 dandi dandi 83110 Mar 2 01:56 000244/.dandi/assets.json 7 -rw-r--r-- 1 dandi dandi 75174 Mar 2 01:56 000245/.dandi/assets.json 8 -rw-r--r-- 1 dandi dandi 213757 Mar 2 01:56 000246/.dandi/assets.json 9 -rw-r--r-- 1 dandi dandi 1959099 Mar 2 01:56 000249/.dandi/assets.json 10 -rw-r--r-- 1 dandi dandi 1248825 Mar 2 01:56 000251/.dandi/assets.json 11 -rw-r--r-- 1 dandi dandi 114066 Mar 2 01:56 000288/.dandi/assets.json 12 -rw-r--r-- 1 dandi dandi 33769 Mar 2 01:56 000292/.dandi/assets.json 13 -rw-r--r-- 1 dandi dandi 381751 Mar 2 01:56 000293/.dandi/assets.json 14 -rw-r--r-- 1 dandi dandi 5826 Mar 2 01:56 000294/.dandi/assets.json 15 -rw-r--r-- 1 dandi dandi 82401 Mar 2 01:56 000295/.dandi/assets.json 16 -rw-r--r-- 1 dandi dandi 3993803 Mar 2 01:56 000296/.dandi/assets.json 17 -rw-r--r-- 1 dandi dandi 317763 Mar 2 01:56 000297/.dandi/assets.json 18 -rw-r--r-- 1 dandi dandi 2889 Mar 2 01:56 000299/.dandi/assets.json 19 -rw-r--r-- 1 dandi dandi 44695 Mar 2 01:56 000301/.dandi/assets.json 20 -rw-r--r-- 1 dandi dandi 59523 Mar 2 01:56 000337/.dandi/assets.json 21 -rw-r--r-- 1 dandi dandi 38987 Mar 2 01:56 000339/.dandi/assets.json 22 -rw-r--r-- 1 dandi dandi 2199860 Mar 2 01:56 000341/.dandi/assets.json 23 -rw-r--r-- 1 dandi dandi 26890 Mar 2 01:56 000347/.dandi/assets.json 24 -rw-r--r-- 1 dandi dandi 37848 Mar 2 01:56 000350/.dandi/assets.json 25 -rw-r--r-- 1 dandi dandi 1062745 Mar 2 01:56 000351/.dandi/assets.json 26 -rw-r--r-- 1 dandi dandi 177409 Mar 2 01:56 000362/.dandi/assets.json 27 -rw-r--r-- 1 dandi dandi 7939 Mar 2 01:56 000397/.dandi/assets.json 28 -rw-r--r-- 1 dandi dandi 70056 Mar 2 01:57 000402/.dandi/assets.json 29 -rw-r--r-- 1 dandi dandi 33985 Mar 2 01:57 000404/.dandi/assets.json 30 -rw-r--r-- 1 dandi dandi 634985 Mar 2 01:57 000405/.dandi/assets.json 31 -rw-r--r-- 1 dandi dandi 2865647 Mar 2 01:57 000409/.dandi/assets.json 32 -rw-r--r-- 1 dandi dandi 68345 Mar 2 01:57 000410/.dandi/assets.json 33 -rw-r--r-- 1 dandi dandi 2704 Mar 2 01:57 000411/.dandi/assets.json 34 -rw-r--r-- 1 dandi dandi 17645 Mar 2 01:57 000447/.dandi/assets.json 35 -rw-r--r-- 1 dandi dandi 2479 Mar 2 01:57 000448/.dandi/assets.json ```
and they all have `schemaVersion` 0.6.3 ``` (dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ grep -L repository */.dandi/assets.json | xargs ls -l | grep -v '2 ... .. ' | sed -e 's,.* ,,g' | xargs grep -h -A1 schemaVersion | grep '^"0' | sort | uniq -c 7453 "0.6.3", ```

and they are more recent. The point is that repository seems to be no longer filled in by the archive although it should have.

waxlamp commented 1 year ago

The dandiset creation service asks for "normalized" metadata for a new Dandiset: https://github.com/dandi/dandi-archive/blob/bffec7754522dcc7197bf4561b000740052bc7f6/dandiapi/api/services/dandiset/__init__.py#L28-L30

which in turn sets up some default metadata (to fill in things not supplied by the caller): https://github.com/dandi/dandi-archive/blob/bffec7754522dcc7197bf4561b000740052bc7f6/dandiapi/api/services/version/metadata.py#L7-L41

and then defers to dandischema for yet more default values: https://github.com/dandi/dandi-archive/blob/bffec7754522dcc7197bf4561b000740052bc7f6/dandiapi/api/services/version/metadata.py#L42-L46

If I understand correctly, the repository value should be auto-injected into this metadata, and the actual value to use is known to the archive codebase (through its deployment settings), but not to dandischema, is that right?

If that's correct, then I think the right thing to do is:

@yarikoptic, does this sound right in terms of what needs to be done?

@AlmightyYakob, @danlamanna, is that a good implementation approach?

jjnesbitt commented 1 year ago

If I understand correctly, the repository value should be auto-injected into this metadata, and the actual value to use is known to the archive codebase (through its deployment settings), but not to dandischema, is that right?

If that's correct, then I think the right thing to do is:

  • add a default value for the repository field to _normalize_version_metadata() and allow that to be overridden by the dandiset creation service
  • run a manual backfill to put that same repository value into all dandisets that don't have anything set for that field
  1. I'm out of the loop on this. What is the repository field supposed to represent? The schema says "Name of the repository in which the resource is housed", but it still seems unclear to me. Should that be dandiarchive.org for any asset in dandi?

  2. What is the value that's known to the archive codebase through deployment settings?

yarikoptic commented 1 year ago

@waxlamp

If I understand correctly, the repository value should be auto-injected into this metadata, and the actual value to use is known to the archive codebase (through its deployment settings), but not to dandischema, is that right?

correct

  • add a default value for the repository field to _normalize_version_metadata() and allow that to be overridden by the dandiset creation service

FWIW in #1103 I did in _populate_metadata and _strip_metadata. Didn't research now

  • run a manual backfill to put that same repository value into all dandisets that don't have anything set for that field

if just inplace in metadata records -- good. If minting new assets as would be needed to done if done via API - then we will eventually do it as a result in addressing https://github.com/dandi/dandi-archive/issues/1450

@AlmightyYakob

  1. I'm out of the loop on this. What is the repository field supposed to represent? The schema says "Name of the repository in which the resource is housed", but it still seems unclear to me. Should that be dandiarchive.org for any asset in dandi?
(base) dandi@drogon:/mnt/backup/dandi/dandisets$ jq . 000003/.dandi/assets.json  | grep reposito | uniq -c
    101       "repository": "https://dandiarchive.org/",

2. What is the value that's known to the archive codebase through deployment settings?

In #1103 I proposed to use settings.DANDI_WEB_APP_URL , I assume it would match that value above.