techman83 commented 8 years ago

Proposal

Similar to #935, we can upload FOSS mods to the Internet Archive. There is example code over at KSP-CKAN/NetKAN-bot for uploading to the IA, the sticking point is the resulting uploaded file.

An Item per Mod Version made sense to me, but open to ideas. As the metadata per version could change.

The big rub is the filename. We have options, that have consequences.

Use similar method to cache, Generate hash from URL and derive the filename
- :+1: We already do it in a similar manner, simple to implement
- :+1: Easy
- :+1: Requires no altering of the metadata
- :-1: Client needs to derive the url
- :-1: Download URL changes, mod gets re-mirrored
- :-1: Major mod hosting site goes offline, a lot of mods get re-mirrored.
- :-1: If a file changes it won't get re-uploaded
Generate a hash from the identifier/version, derive the filename
- :+1: Easy
- :+1: Requires no altering of the metadata
- :+1: Download url changes, file is not re-uploaded.
- :-1: Client needs to derive the url
- :-1: If a file changes it won't get re-uploaded
Generate a hash on the downloaded file and add it to the metadata
- :+1: We have a hash of the file
- :+1: Client doesn't have to derive the filename
- :-1: Requires altering the metadata
- :-1: Requires spec change
- :+1: :-1: If a file changes we could re-upload it and add it to the metadata. This may not always be desirable.

I came up with 2 whilst I was writing this.. It's the option I'm currently leaning towards. Pinging @dbent @pjf being more across internals of the Client code base and @Dazpoet @plague006 @politas being across the metadata.

Implementation

Option 3 was decided on.

Phase 1

[x] Get NetKAN to produce the field
[x] Start uploading things. 95% of the code is already done, just add a webhook shim and watch CKAN-meta for changes.
[x] Scan for any hooks we missed daily with git diff $(git rev-list -n1 --before="yesterday" master) --name-only KSP-CKAN/NetKAN-bot#49

Phase 2

[x] Implement whatever we decide with CKAN-meta/NetKAN amalgamation - #3890
[ ] Rearchitect the indexer (I have some solid ideas).
[ ] Crawling head/api checks all day
[ ] Full inflation daily on NetKAN_repo/NetKAN/
[ ] Continue to inflate on demand (Captures both NetKAN + CKAN-meta)
[ ] ~~Scan NetKAN_repo for any changes we missed git diff $(git rev-list -n1 --before="yesterday" master) --name-only -p NetKAN_repo/CKAN-meta/~~ covered by full inflation as all former ckans are treated the same way as netkans.

Phase 3 (Though technically this can be done anytime)

[ ] CKAN Schema update
[ ] CKAN Client implementation

Phase 4

[x] Slowly crawl the entire NetKAN_repo/CKAN-meta
- 6088 mods mirrored
- 6476 mods with download hashes
- 356 mods which don't allow mirriong
- 16 mods with hashes but dead urls
- 137 mods which failed to inflate (could be dead url or changed zip)

dbent commented 8 years ago

I'm in favor of option 3 for the following reasons:

It's the most elegant technically and solves all the issues related to identifying a particular file not just a URL or identifier+version.
It would be trivial to implement in NetKAN.
BONUS: It's something we want to do eventually anyway.

As far as the downsides:

Requires altering the metadata: No big deal I think, we do it all the time.
Requires spec change: Technically true, but not a backwards-incompatible change. It's a limitation of the way we currently version the spec that we can't represent backwards-compatible and backwards-incompatible changes to the spec. By which I mean, if the spec goes from X -> Y to add a hash it's not like a client that could install X can no longer install Y, they just won't know what to do with the hash. I've thought about how to handle changes like this in my ideas for "CKAN2" but for now the simplest option would be to stick the hash in an x property, i.e. x_hash as "extra" metadata so we don't need a spec change.
If a file changes we could re-upload it and add it to the metadata. This may not always be desirable.: Why wouldn't it be desirable? Right now file changes are totally invisible, at least if we track the hash we would notice.

A downside you didn't mention:

Complicates adding metadata manually, i.e. directly to CKAN-meta: Simple! We stop adding things directly to CKAN-meta, instead we add everything to NetKAN. All .ckan files are also valid .netkan files, you can just stick the manually created .ckan file in the NetKAN repo with a .netkan extension and NetKAN will happily generate a .ckan with the static information... but also with an automatically generated hash value.

mheguy commented 8 years ago

Complicates adding metadata manually

A couple hours ago @techman83 made changes to the indexer such that on merges/commits to the NetKAN repo, inflation of the changes is done automatically and immediately. This all but eliminates the need to touch CKAN-meta (aside from modifying earlier versions of mods).

Further, from a personal workflow perspective I would find it far more efficient if all mods/identifiers were available in the NetKAN directory.

techman83 commented 8 years ago

@dbent Excellent points and I'm pleased you've chimed in!

It would be trivial to implement in NetKAN.

This gets current releases, but that's really cool and not something I'd thought of. I could have the mirrorbot look for the hash and upload the file.It would make noting the hash change cause a re-upload which saves trying to figure out if a Mod needs re-uploading.

If the CKAN client will ignore the extra metadata for now, I'd be all for adding a 'download_hash' field or something of that ilk and worrying about how the client implements it later.

As a minor implementation detail, if you could have NetKAN cache the file with the hash filename it would save us some double handling and also solves the infrequent need for me to login in and remove faulty cached downloads.

Right now file changes are totally invisible, at least if we track the hash we would notice.

Fair point, I was thinking if we end up with a dodgy zip replacing a good one. But the old one will still be on the Archive with a different hash, we could add 'x-netkan-hash-override' for exceptions if we go to having NetKAN do everything. (This also solidifies my Item per version line of thinking).

Complicates adding metadata manually, i.e. directly to CKAN-meta: Simple! We stop adding things directly to CKAN-meta, instead we add everything to NetKAN. All .ckan files are also valid .netkan files, you can just stick the manually created .ckan file in the NetKAN repo with a .netkan extension and NetKAN will happily generate a .ckan with the static information... but also with an automatically generated hash value.

It would require an indexing strategy change as inflating ~6000 mods hourly won't scale. However we could create a CKAN-meta-manual (or put a CKAN-meta folder into the NetKAN repo) and put our current set of CKANs into it. That would take care of our initial run of populating the hash of all the files. We can then just use Webhooks to inflate new stuff pushed into it on demand.

We might want to cancel that Jenkins job though :laughing:

An implementation strategy could be:

Phase 1

Get NetKAN to produce the field
Start uploading things. 95% of the code is already done, just add a webhook shim and watch CKAN-meta for changes.
Scan for any hooks we missed daily with git diff $(git rev-list -n1 --before="yesterday" master) --name-only

Phase 2

Implement whatever we decide with CKAN-meta/NetKAN amalgamation
Rearchitect the indexer (I have some solid ideas).
- Crawling head/api checks all day
- Full inflation daily on NetKAN_repo/NetKAN/
- Continue to inflate on demand (Captures both NetKAN + CKAN-meta)
- Scan NetKAN_repo/CKAN-meta for any changes we missed git diff $(git rev-list -n1 --before="yesterday" master) --name-only -p NetKAN_repo/CKAN-meta/

Phase 3 (Though technically this can be done anytime)

CKAN Schema update
CKAN Client implementation

Phase 4

Slowly crawl the entire NetKAN_repo/CKAN-meta

Can I just say, I <3 your idea @dbent

dbent commented 8 years ago

@techman83

This gets current releases

Yeah, back-filling the data is an exercise left to the reader. (Although at some point I do want to implement NetKAN support for dumping metadata for all releases, not just the latest).

If the CKAN client will ignore the extra metadata for now, I'd be all for adding a 'download_hash' field or something of that ilk and worrying about how the client implements it later.

Yeah, that can be done.

As a minor implementation detail, if you could have NetKAN cache the file with the hash filename it would save us some double handling and also solves the infrequent need for me to login in and remove faulty cached downloads.

Absolutely, something I've wanted to do anyway.

It would require an indexing strategy change as inflating ~6000 mods hourly won't scale.

Is ~6000 the number of total .ckan we have? I'm talking about only having one .netkan for each mod no matter what. So for a "manual" mod, instead of creating a new .ckan directly, we update its .netkan:

Initial AwesomeMod.netkan:

{
  "identifer": "AwesomeMod",
   "version": "1.0",
   "download": "http://awesomemod.example/download/1.0"
}

Which would spit out AwesomeMod-1.0.ckan in CKAN-meta.

After a manual update AwesomeMod.netkan:

{
  "identifer": "AwesomeMod",
   "version": "2.0",
   "download": "http://awesomemod.example/download/2.0"
}

Which would spit out AwesomeMod-2.0.ckan in CKAN-meta. But now we get NetKAN automatically generating the hash for us.

From a rough look there seems to be about ~1400 individual mods in CKAN-meta, versus ~1200 individual mods in NetKAN, so a ~17% increase in NetKAN indexing.

If we can create an endpoint for GitHub webhooks and then have authors authorize the NetKAN application to get webhook events, that plus the existing SpaceDock hooks, we could probably scale down our full NetKAN indexing to daily or semi-daily runs.

Phases seem good.

techman83 commented 8 years ago

Yeah, back-filling the data is an exercise left to the reader. (Although at some point I do want to implement NetKAN support for dumping metadata for all releases, not just the latest).

I have ideas for that. So all good!

Is ~6000 the number of total .ckan we have? I'm talking about only having one .netkan for each mod no matter what. So for a "manual" mod, instead of creating a new .ckan directly, we update its .netkan

Ohhhh I see! That makes more sense and would be ok even with the batch indexer as it sits right now. We could configure jenkins to fail all builds and direct people to NetKAN. This can happen anytime.

5620 total ckans and 1188 orphaned by kerbalstuff, the ~6000 was my rough estimate of how many FOSS ckans there are, I was wildly off by the looks :laughing:

If we can create an endpoint for GitHub webhooks and then have authors authorize the NetKAN application to get webhook events, that plus the existing SpaceDock hooks, we could probably scale down our full NetKAN indexing to daily or semi-daily runs.

The only issues I see are that it would involve authors configuring webhooks and us somehow mapping them and we wouldn't be able to authenticate the hooks.

With 1400 mods, we could crawl 23/minute with a lite scan and have the whole lot checked hourly without belting the endpoints.

techman83 commented 8 years ago

And we have a collection to house them all in now!

https://archive.org/details/kspckanmods

mheguy commented 8 years ago

As an aside if we did end up with too many unique .netkans after combining traditional netkans and the ckans-come-netkans we can use 2 extension types (or a field within the file or whatever) that differentiates between metadata that should be inflated/checked routinely and metadata that only needs to be checked when there's a change to the metadata.

techman83 commented 8 years ago

@plague006 we could easily stick them in a separate Directory, then let the Webhooks sort it out.

dbent commented 8 years ago

The only issues I see are that it would involve authors configuring webhooks...

It would mostly involve us creating an "application" and then providing a link for authors to authorize us and then we'd setup the webhook ourselves. Manual configuration shouldn't be necessary, just click a link, and then click an authorize button.

...and us somehow mapping them...

This could be done by looking at $kref for the associated GitHub repository. How performant this would be and whether or not we'd need to somehow cache the mapping I cannot say.

...and we wouldn't be able to authenticate the hooks.

If we set a secret in the webhook when we create it GitHub will use HMAC to sign any payloads which we could then verify.

techman83 commented 8 years ago

It would mostly involve us creating an "application" and then providing a link for authors to authorize us..

Good point, that wouldn't be super difficult.

How performant this would be and whether or not we'd need to somehow cache the mapping I cannot say.

We already scan all the metadata every test. The JSON::XS library is pretty darn fast, scanning through all the netkans takes pretty little time at all, we could even just store it in memory if we wanted to (1400 items is nothing really).

foreach my $shortname (keys %files) {
    my $metadata = read_netkan($files{$shortname});
    if ( $metadata->{'$kref'} eq '#/ckan/spacedock/234' ) {
      say "Found";
    }
}

Takes no time at all on my old laptop.

leon@ultra ~/git/ckan/NetKAN $ time perl t/scanall.pl 
Found

real    0m0.321s
user    0m0.260s
sys 0m0.051s

If we set a secret in the webhook when we create it

Yeah, we're already verifying the WebHooks we currently receive using HMAC.

I also had a thought about the hash filename, how are we going to get that hash without re-downloading the file every time we inflate? We have 8.6GB cached recent files, redownloading that regularly might be a little unfriendly to our AWS credits and the hosts we're scraping.

If we move entirely to WebHooks/api/head checks and only inflate exceptions this would reduce the impact significantly.

politas commented 8 years ago

I like the idea of only making netkans, so CKAN-meta is only populated automatically. I already use that technique when creating CKANs locally.

techman83 commented 8 years ago

I threw something together to generate hashes, the only thing left to cover off is file extensions. Though looking at the API, we can just loop through the files looking for a matching sha1 (testing shows it matches what GetFileHash generates, except lowercase).

{
    "created": 1461702894,
    "d1": "ia601506.us.archive.org",
    "d2": "ia801506.us.archive.org",
    "dir": "/21/items/AdvancedJetEngine-v2.6",
    "files": [
        {
            "crc32": "77e2c2a7",
            "format": "ZIP",
            "md5": "d3587eb415d144e9468f4b75d468c4d4",
            "mtime": "1461471119",
            "name": "90ADF637-AdvancedJetEngine-v2.6.zip",
            "sha1": "1bc6c5ee2827c33ca3ec0a6659a92be03fd574ae",
            "size": "635744",
            "source": "original"
        },
<snip>
    ],
    "files_count": 5,
<snip>
}

@dbent thoughts? We could write a FileExtTransformer or just hit up the API if we want to look for a download url.

dbent commented 8 years ago

@techman83 So if I'm understanding correctly... a download URL from IA looks like:

https://archive.org/download/AdvancedJetEngine-v2.5.4/DB37AFAF-AdvancedJetEngine-v2.5.4.zip

Which CKAN would determine by using the following values from the metadata:

https://archive.org/download/<identifier>-<version>/<sha1-hash>.zip

Is the plan to have the filename just be the hash, identifier+hash, identifier+version+hash?

So the only other bit we need is what the file extension will be? Am I understanding that correctly?

In that case I'm open to two ideas:

Hit up the API and just search for the matching file as you said above. I'm assuming the API call is on a per "item" basis so it's not too expensive to do, correct?
Instead of storing the file extension in the metadata, store the media type (since its more generally useful) and then choose a canonical extension for each media type. So, ZIP files have the media type application/zip and we'd choose the canonical extension of .zip.

{
  "download_content_type": "application/zip"
}

techman83 commented 8 years ago

Is the plan to have the filename just be the hash, identifier+hash, identifier+version+hash?

Below is how it currently is, though I'm not tied to it. Being the internet archive, having somewhat human readable names seemed logical.

https://archive.org/download/<identifier>-<version>/<shortened-sha1-hash>-<identifier>-<version>.<ext>

Yes you got it. Either option works, it's a single api call to get all the metadata about the item

leon@ultra /tmp $ curl https://archive.org/metadata/AdvancedJetEngine-v2.5.4/
{"created":1461702893,"d1":"ia601501.us.archive.org","d2":"ia801501.us.archive.org","dir":"/29/items/AdvancedJetEngine-v2.5.4","files":[{"name":"DB37AFAF-AdvancedJetEngine-v2.5.4.zip","source":"original","mtime":"1461471083","size":"562133","md5":"b1b8147ef1d5142854484d4a5d14c561","crc32":"7ce98abc","sha1":"57f18ee5abf67614b2401c26ac75ff515dbba659","format":"ZIP"},{"name":"AdvancedJetEngine-v2.5.4_meta.sqlite","source":"original","mtime":"1461471095","size":"9216","md5":"ff60a603549df662d5342dd1c98f0a64","crc32":"b200e9a6","sha1":"bd9175481af97dc7c147e05fc876e5972fd73580","format":"Metadata"},{"name":"AdvancedJetEngine-v2.5.4_meta.xml","source":"original","mtime":"1461702891","size":"1152","format":"Metadata","md5":"c10816051dd5e1a8fa20af607e3de1a7","crc32":"85c53b5f","sha1":"30ac43c19c6df204a9b775acec35182c6a4ea87e"},{"name":"AdvancedJetEngine-v2.5.4_archive.torrent","source":"metadata","btih":"fcabeb836897c7a34da20e3cb2124869dac1a22c","mtime":"1461702892","size":"1799","md5":"846e30be51c4ad2ff82c9ba992cedbf6","crc32":"98362ca3","sha1":"f979e39e5024ad7025380d20fe066ab59e00c7e5","format":"Archive BitTorrent"},{"name":"AdvancedJetEngine-v2.5.4_files.xml","source":"original","format":"Metadata","md5":"3b8de6eff7a0646683d5fd14331e5f84"}],"files_count":5,"item_size":574300,"metadata":{"identifier":"AdvancedJetEngine-v2.5.4","collection":["kspckanmods","software"],"creator":"camlost","description":"Realism for turbojet, turbofans, air-breathing rockets, propellers, and rotors in KSP.<br /><br />Homepage: <a href=\"http://forum.kerbalspaceprogram.com/threads/70008\" rel=\"nofollow\">http://forum.kerbalspaceprogram.com/threads/70008</a><br />Repository: <a href=\"https://github.com/camlost2/AJE\" rel=\"nofollow\">https://github.com/camlost2/AJE</a><br />License(s): LGPL-2.1","licenseurl":"https://www.gnu.org/licenses/old-licenses/lgpl-2.1.html","mediatype":"software","subject":"ksp; kerbal space program; mod","title":"Advanced Jet Engine - v2.5.4","publicdate":"2016-04-24 04:11:24","uploader":"archive.org@ksp-ckan.org","addeddate":"2016-04-24 04:11:24","curation":"[curator]validator@archive.org[/curator][date]20160424041644[/date][comment]checked for malware[/comment]"},"server":"ia801501.us.archive.org","uniq":1973401572,"updated":1461720798,"workable_servers":["ia801501.us.archive.org","ia601501.us.archive.org"]}

Option 2 is worth considering, because we might end up with multiple mirrors one day and the other mirrors might not have a nice api for us to get the required information. Media type is sensible, easy enough to derive extensions from that, the NetKAN code already does it.

techman83 commented 8 years ago

@dbent @pjf @KSP-CKAN/wranglers

The majority of 'Phase 1' is complete. NetKAN produces the Download hashes for the Mirror library to consume. The @KSP-CKAN/wranglers have done an awesome job standardizing our metadata in KSP-CKAN/NetKAN#3890 and as of KSP-CKAN/NetKAN-bot#38 the bots are uploading license compliant CKANs to the Internet Archive.

Current workflow looks like: NetKAN Inflated (via Indexer, Spacedock Webhook or GitHub Webhook) -> NetKAN Commits to CKAN-meta -> CKAN Uploaded after Webhook Notification on CKAN-meta (If license compliant + download_hash exists)

@dbent I've had more time to think, it's likely we'll build a resolver for whichever mirror backend and cycle through them. Using the API allows us to not be dependent on file names, the SHA1 is always going to resolve to the correct file.

techman83 commented 8 years ago

@dbent @pjf @KSP-CKAN/wranglers

Phase 1 is completed and Phase 4 is in progress. I setup a separate bot account called 'kspckan-crawler' so we can easily tell which commits the crawler has performed. You will notice a lot of changes go through over the next few weeks. It's currently checking 2 mods every 5 minutes, so at the current rate somewhere around 17 days to go over the entire collection. Probably sooner as there were a large number of mods with download hashes, but not mirrored and those are being checked separately at the same rate (2 mods every 5 minutes).

dbent commented 8 years ago

@techman83 Excellent, and I see you fixed the bot name. :smile:

We really ought to give the bots the same avatar as the KSP-CKAN org: https://avatars1.githubusercontent.com/u/9023970?v=3&s=200

techman83 commented 8 years ago

@dbent I don't know what you're talking about - maybe @politas does :laughing:

That's an awesome idea. Seems avatar uploads are currently broken :hammer:

techman83 commented 8 years ago

@dbent @KSP-CKAN/wranglers

ckan-notices should be a lot quieter now. The crawler has completed its task.

Ruedii commented 7 years ago

We should probably also add that developers can add several additional download links instead of just one.

This would work in combination with this archiving technique.

politas commented 7 years ago

An interesting case has come up. We've had the wrong licence for a bunch of Snark's mods since around October last year. KSP-CKAN/NetKAN#5512 is the start of fixing up the licence info in CKAN, but we should update the licence info on archive.org. Is that possible? Is there a way to automate that?

techman83 commented 7 years ago

@politas It is definitely possible, automating it would be reasonably straight forward. Knowing when to trigger that update is the trickier part.

KSP-CKAN / CKAN

Upload FOSS mods to the Internet Archive, allow clients to fallback #1682

Proposal

Implementation