allenai / satlas

Apache License 2.0
184 stars 19 forks source link

Notes from first-time user #16

Open louisguitton opened 11 months ago

louisguitton commented 11 months ago

After chatting with @favyen2 , I had a look at the repo and started playing around. Here are some notes in case they prove useful to update documentation.

Notes

List of remotes

There are 3 Remotes with downloadable data:

Interacting with the S3 bucket

When deciding what to download, I found the need to know "a priori" the size of what I was going to download. Partly to make sure I download the right thing, and partly to inform my system choice (i.e. do I work from my mac or from a cloud box) To understand what you have published at a glance (training datasets & model weights) as well as their respective sizes, I ran:

aws s3 ls s3://ai2-public-datasets/satlas/ --human-readable

Interacting with the R2 bucket

Just like for the S3 bucket, I wanted to list files present in the bucket with their size (dataset & model weights) especially as this Remote is apparently used for the fine-tuning tasks that interest me.

R2 is supposed to expose a S3 API (ref) Unfortunately, I was unable to get anywhere and I don't get a helpful error message either, so I am stuck:

→ aws s3api list-buckets --endpoint-url  https://pub-956f3eb0f5974f37b9228e0a62f449bf.r2.dev/

An error occurred () when calling the ListBuckets operation:
→ aws s3api list-objects-v2 --endpoint-url https://pub-956f3eb0f5974f37b9228e0a62f449bf.r2.dev --bucket satlas_explorer_datasets

An error occurred () when calling the ListObjectsV2 operation:

Minor gitignore tweak

Because the docs expect me to populate a models/ and a vis/ folder, but those are not tracked in git, I ended up adding those 2 folders to my local gitignore in .git/info/exclude so that they don't get tracked while not touching the committed .gitignore

Solar Farm model links?

According to the docs, models/solar_farm/best.pth is one of the artifacts present in the R2 bucket (https://pub-956f3eb0f5974f37b9228e0a62f449bf.r2.dev/satlas_explorer_datasets/satlas_explorer_datasets_2023-07-24.tar). Is there a way to download only that file directly and not the rest of the archive?

louisguitton commented 11 months ago

just saw that my Solar farm question was answered in #12

srinify commented 8 months ago

@louisguitton would it be easier if we just migrated all the files over into the GitHub repo and used either Git LFS or XetHub to host the large files themselves? Then people don't have to juggle interacting with 3 different data sources / hosting providers.

When you run git clone ... or git pull ..., the large files also will appear locally along with the code while GitHub just sees pointers / hashes. This will also eliminate the need for storing models and datasets in the .gitignore file

Proposed here: https://github.com/allenai/satlas/issues/25