Closed kirbysayshi closed 10 years ago
Wops
I have a cron job which runs every second day, had this running a couple of years now. But this seems to have failed, and I haven't noticed :( I think it takes about 10 hours to run.
If you have any suggestion I would love to hear :smile:
I could probably tell my cron to send me an email when it fails
I suppose the biggest problem is rate limiting from GH, since with npm at ~50,000 packages, and a request limit of 5000 per hour, that means at minimum 10 hours... yes, I see the problem :) !
One possible option is to use github's conditional requests + ETags, which apparently don't count towards the rate limit if you receive a 304. So I guess the utility would have to maintain metadata for the GH requests, and then use that to continuously query. Complicated.
The two day delay is not only because of the limit, but I think it is short enough, and I don't want to push too often to github.
It would be possible to move the json file out as well. I have been thinking about S3 with cloudfront, although then it would not be completely gh-pages based and lose some of the "charm".
I have not thought about conditional request, that might be a good addition actually, I will look into that thanks.
The reason my cron job has failed is because node terminates with std::bad_alloc
. So I guess its a memory leak, or the JSON files are simply too large. I'm going away for a week now so I won't be able to look at it right now.
I have started nipster on another server, just to see if that would magically help.
Cool!
I bet you could also get some mileage out of a streaming JSON serializer so you don't have to keep all responses in memory at once... but I'm not sure how large that data file actually is (I haven't run nipster locally).
Streaming JSON? What is this magic, looks very nice :) I was keeping the whole response from GitHub in the object, which is quite large, and as I am saving every 100 pacakge (just so I can start/stop whenever I want) it got too much data.
Now with all packages including the parts from GitHub I need the JSON file is 28MB, which is not that much (including all 45 000 pacakges).
I just found this project today, and it seems really useful. But do you have any ideas how to automate keeping packages.json up to date automatically? The last update was 25 days ago, and I'm sure it's tedious for you to continually update it.
Perhaps a public dropbox folder, or cron job that runs and commits once a night?