Smithsonian / OpenAccess

Smithsonian Open Access Data Repository
https://www.si.edu/openaccess
Creative Commons Zero v1.0 Universal
391 stars 14 forks source link

Per-unit metadata repositories? #7

Closed straup closed 3 years ago

straup commented 4 years ago

Cloning this repository has gotten progressively more difficult and time-consuming. Three times yesterday I tried to pull from the master branch and each time the update failed with a network error along the lines of:

> git pull origin master
remote: Enumerating objects: 32703, done.
remote: Counting objects: 100% (32652/32652), done.
remote: Compressing objects: 100% (24562/24562), done.
Receiving objects:  24% (7360/29590), 2.86 GiB | 418.00 KiB/s   

Receiving objects:  24% (7361/29590), 2.86 GiB | 395.00 KiB/s

client_loop: send disconnect: Broken pipe46 GiB | 51.00 KiB/s  
fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed

This is not uncommon with very large repositories in GitHub and has been the experience with the Who's On First data repositories.

I am currently trying to do a fresh shallow clone (--depth 1) to see if that works any better but, realistically, I don't expect it to.

Since the data is already broken up in to unit specific subdirectories would you be open to splitting the data in to per SI unit repositories to make it easier to clone the data?

This would also make it easier and faster to work with specific collections rather than having to clone all of the Smithsonian from the outset.

straup commented 4 years ago

Just to follow up, as expected:

> git clone --depth 1 git@github.com:Smithsonian/OpenAccess.git
Cloning into 'OpenAccess'...
remote: Enumerating objects: 7163, done.
remote: Counting objects: 100% (7163/7163), done.
remote: Compressing objects: 100% (7161/7161), done.
client_loop: send disconnect: Broken pipe GiB | 1020.00 KiB/s  
fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
andrewgunther commented 4 years ago

No plans to divide up the repositories.. If GitHub is causing you grief I suggest you use all the files that are posted to AWS S3 instead.... The files are txt (line delineated json), same hash structures..2 chars.. 0-9a-f.. Looks like we lower cased the directories though. Data is updated every week (check headers). sample: https://smithsonian-open-access.s3-us-west-2.amazonaws.com/metadata/edan/chndm/0f.txt ...Last-Modified: Mon, 21 Sep 2020 10:48:31 GMT current directories: aaa acm chndm fs fsg hmsg naa nmaahc nmah nmnhanthro nmnhbotany nmnhento nmnhherps nmnhmammals nmnhpaleo npm saam sia acah cfchfolklife fbr fsa hac hsfa nasm nmafa nmai nmnhbirds nmnheducation nmnhfishes nmnhinv nmnhminsci npg nzp si sil

straup commented 4 years ago

Just to confirm:

The available files in a given unit-directory are: 0-9 followed by a-f and then a-f followed by (0-9 followed by a-f ) with a trailing .txt extension?

Is there a public index of all the available subdirectories and all their child files published somewhere? It doesn't seem like the AWS bucket has public directory listings.

andrewgunther commented 4 years ago

I realized this morning that it would be good to have a directory of sorts. or I'll turn on directory listings. I'll follow thru on that.. In the meantime every directory will have 256 files... and yes you are correct with the naming. 00.txt 09.txt 12.txt 1b.txt 24.txt 2d.txt 36.txt 3f.txt 48.txt 51.txt 5a.txt 63.txt 6c.txt 75.txt 7e.txt 87.txt 90.txt 99.txt a2.txt ab.txt b4.txt bd.txt c6.txt cf.txt d8.txt e1.txt ea.txt f3.txt fc.txt 01.txt 0a.txt 13.txt 1c.txt 25.txt 2e.txt 37.txt 40.txt 49.txt 52.txt 5b.txt 64.txt 6d.txt 76.txt 7f.txt 88.txt 91.txt 9a.txt a3.txt ac.txt b5.txt be.txt c7.txt d0.txt d9.txt e2.txt eb.txt f4.txt fd.txt 02.txt 0b.txt 14.txt 1d.txt 26.txt 2f.txt 38.txt 41.txt 4a.txt 53.txt 5c.txt 65.txt 6e.txt 77.txt 80.txt 89.txt 92.txt 9b.txt a4.txt ad.txt b6.txt bf.txt c8.txt d1.txt da.txt e3.txt ec.txt f5.txt fe.txt 03.txt 0c.txt 15.txt 1e.txt 27.txt 30.txt 39.txt 42.txt 4b.txt 54.txt 5d.txt 66.txt 6f.txt 78.txt 81.txt 8a.txt 93.txt 9c.txt a5.txt ae.txt b7.txt c0.txt c9.txt d2.txt db.txt e4.txt ed.txt f6.txt ff.txt 04.txt 0d.txt 16.txt 1f.txt 28.txt 31.txt 3a.txt 43.txt 4c.txt 55.txt 5e.txt 67.txt 70.txt 79.txt 82.txt 8b.txt 94.txt 9d.txt a6.txt af.txt b8.txt c1.txt ca.txt d3.txt dc.txt e5.txt ee.txt f7.txt 05.txt 0e.txt 17.txt 20.txt 29.txt 32.txt 3b.txt 44.txt 4d.txt 56.txt 5f.txt 68.txt 71.txt 7a.txt 83.txt 8c.txt 95.txt 9e.txt a7.txt b0.txt b9.txt c2.txt cb.txt d4.txt dd.txt e6.txt ef.txt f8.txt 06.txt 0f.txt 18.txt 21.txt 2a.txt 33.txt 3c.txt 45.txt 4e.txt 57.txt 60.txt 69.txt 72.txt 7b.txt 84.txt 8d.txt 96.txt 9f.txt a8.txt b1.txt ba.txt c3.txt cc.txt d5.txt de.txt e7.txt f0.txt f9.txt 07.txt 10.txt 19.txt 22.txt 2b.txt 34.txt 3d.txt 46.txt 4f.txt 58.txt 61.txt 6a.txt 73.txt 7c.txt 85.txt 8e.txt 97.txt a0.txt a9.txt b2.txt bb.txt c4.txt cd.txt d6.txt df.txt e8.txt f1.txt fa.txt 08.txt 11.txt 1a.txt 23.txt 2c.txt 35.txt 3e.txt 47.txt 50.txt 59.txt 62.txt 6b.txt 74.txt 7d.txt 86.txt 8f.txt 98.txt a1.txt aa.txt b3.txt bc.txt c5.txt ce.txt d7.txt e0.txt e9.txt f2.txt fb.txt

straup commented 4 years ago

Also, is there a specific reason the compressed files aren't also uploaded to S3? For the purposes of saving bandwidth, on both ends, and all that good stuff.

andrewgunther commented 4 years ago

The files submitted to AWS are part of the AWS Open Data program and I think Amazon Athena wants them to be uncompressed. It was planned to also upload parquet formatted files as well. Since the data set is small I can also upload the compressed versions as well.

thisisaaronland commented 3 years ago

Hi,

Just checking on the status of public directory listings? They don't seem to be enabled yet:

https://smithsonian-open-access.s3-us-west-2.amazonaws.com/metadata/edan/chndm/

andrewgunther commented 3 years ago

At this time we are unable to change ACLs to support directory listings. Instead a index.txt file was made listing all files. https://smithsonian-open-access.s3-us-west-2.amazonaws.com/metadata/edan/chndm/index.txt

thisisaaronland commented 3 years ago

Okay, good to know. Thanks.

thisisaaronland commented 3 years ago

It seems as though most of the unit directories are missing a corresponding index.txt file. I have code to manually generate the list of 2-character files (listed in index.txt) so it's not a huge problem but I thought you'd like to know.

> go run -mod vendor cmd/emit/main.go -bucket-uri 's3://smithsonian-open-access?region=us-west-2' metadata
2020/11/18 23:13:39 Failed to open metadata/edan/acah/index.txt
2020/11/18 23:13:39 Failed to open metadata/edan/acm/index.txt
2020/11/18 23:13:39 Failed to open metadata/edan/cfchfolklife/index.txt
2020/11/18 23:13:40 Failed to open metadata/edan/fbr/index.txt
2020/11/18 23:13:40 Failed to open metadata/edan/fsa/index.txt
2020/11/18 23:13:40 Failed to open metadata/edan/fsg/index.txt
2020/11/18 23:13:40 Failed to open metadata/edan/hac/index.txt
2020/11/18 23:13:40 Failed to open metadata/edan/hmsg/index.txt
2020/11/18 23:13:40 Failed to open metadata/edan/hsfa/index.txt
2020/11/18 23:13:40 Failed to open metadata/edan/naa/index.txt
2020/11/18 23:13:40 Failed to open metadata/edan/nasm/index.txt
2020/11/18 23:13:41 Failed to open metadata/edan/nmaahc/index.txt
2020/11/18 23:13:41 Failed to open metadata/edan/nmah/index.txt
2020/11/18 23:13:41 Failed to open metadata/edan/nmai/index.txt
2020/11/18 23:13:41 Failed to open metadata/edan/nmafa/index.txt
2020/11/18 23:13:41 Failed to open metadata/edan/nmnhanthro/index.txt
2020/11/18 23:13:41 Failed to open metadata/edan/nmnhbirds/index.txt
2020/11/18 23:13:41 Failed to open metadata/edan/nmnhbotany/index.txt
2020/11/18 23:13:41 Failed to open metadata/edan/nmnheducation/index.txt
2020/11/18 23:13:41 Failed to open metadata/edan/nmnhento/index.txt
2020/11/18 23:13:42 Failed to open metadata/edan/nmnhfishes/index.txt
2020/11/18 23:13:42 Failed to open metadata/edan/nmnhherps/index.txt
2020/11/18 23:13:42 Failed to open metadata/edan/nmnhinv/index.txt
2020/11/18 23:13:42 Failed to open metadata/edan/nmnhmammals/index.txt
2020/11/18 23:13:42 Failed to open metadata/edan/nmnhminsci/index.txt
2020/11/18 23:13:42 Failed to open metadata/edan/nmnhpaleo/index.txt
2020/11/18 23:13:42 Failed to open metadata/edan/npg/index.txt
2020/11/18 23:13:42 Failed to open metadata/edan/npm/index.txt
2020/11/18 23:13:42 Failed to open metadata/edan/saam/index.txt
2020/11/18 23:13:43 Failed to open metadata/edan/si/index.txt
2020/11/18 23:13:43 Failed to open metadata/edan/sia/index.txt
2020/11/18 23:13:43 Failed to open metadata/edan/sil/index.txt
andrewgunther commented 3 years ago

yes.. you've been focused on Cooper Hewitt so I only did that one... The next push (Sunday) will have address all the other units.

thisisaaronland commented 3 years ago

Cooper Hewitt was just an example. For good or bad, I am focused on all of it :D :D :D

https://github.com/aaronland/go-smithsonian-openaccess/

Like I said I have code to generate the list of files on the fly but an index file is a nice to have and probably good practice for this sort of thing.

thisisaaronland commented 3 years ago

FYI, now with support for retrieving data from the smithsonian-open-access S3 bucket:

https://github.com/aaronland/go-smithsonian-openaccess#data-sources

andrewgunther commented 3 years ago

index files have been added for all unit directories. in addition a index.txt file has been added to https://smithsonian-open-access.s3-us-west-2.amazonaws.com/metadata/edan/index.txt

thisisaaronland commented 3 years ago

That's super helpful. Thanks!