bluesky-social / indigo

Go source code for Bluesky's atproto services.
https://atproto.com
Apache License 2.0
661 stars 99 forks source link

backfill: rev is not loaded from DB #629

Open mrd0ll4r opened 6 months ago

mrd0ll4r commented 6 months ago

I'm currently backfilling all repositories. I know it's possible to apply diffs from the firehose data, but for the moment I've chosen not to do that, since I feel like there are a bunch more edge cases to consider, and for some reason my subscription to the firehose is somewhat flaky. So for now I'm downloading the repos using sync.getRepo. To speed things up, I'd like to not re-download repos for which I have the latest revision locally. As such, when listing repos using sync.listRepos, I check the returned rev with the one in (existing) gorm jobs, to decide whether to re-enqueue them.

When loading a job from the database, some fields are filled, but the rev field is not: https://github.com/bluesky-social/indigo/blob/main/backfill/gormstore.go#L214-L225 This makes it impossible to check whether we already downloaded the repo at that version without looking at the downloaded data, which is annoying.

To make matters worse, sync.ListRepos seems to return empty strings for the rev field. The head field is populated -- would it be a better approach to use that to check whether we already have the lastest version of a repo downloaded? See, e.g.:

$ curl -L -X GET 'https://bsky.network/xrpc/com.atproto.sync.listRepos?limit=10&cursor=' -H 'Accept: application/json' | jq '.repos[].rev'
""
""
""
""
""
""
""
""
""
""

Also (sorry this devolved into three issues in one) I just noticed the generated CURL sample on the docs website is wrong, it shows as:

curl -L -X GET 'https://bsky.social/xrpc' \
-H 'Accept: application/json'

without a method.

Sorry for the triple issue!

bnewbold commented 5 months ago

These all sound like legit issues and developer papercuts. Thanks for reporting!

If you have some small quick-fix PRs with no performance concerns, we might be able to get those in, but we are juggling a bunch of priorities and work streams and it might take a while.

Could you open a separate issue in the docs site repo about the curl examples being incomplete? Thanks!