Open simoniz0r opened 6 years ago
@antony-jr you can hardly call this "web scraping". If we decide to perform web scraping, we should implement it properly. Then, we can tell exactly when the HTML format changes, and tell users to report us this. No debugging needed.
@TheAssassin Thats the point , I don't want a full-fledged web-scraper , I want to make this less expensive as possible. Example of the processed regex must be like this -> (AppImageUpdate\-).*(\-\.Appimage\.zsync)
, which will only give us the filename of the zsync file which is all we want. Now we got the missing piece in the puzzle , For generic zsync method we need a static filename and so we can now do the generic zsync method. In-Short: We just need the filename of the .zsync file in the releases of the specific tag to derive a url to the .zsync file.
EDIT:
ZSYNC_FILE_URL = https://github.com/(username)/(repo)/releases/(tag)/ + AppImageUpdate-xxxx-.Appimage.zsync
@antony-jr I already explained in detail why your method is not reliable. Just because you don't want real web scraping, it doesn't make your solution any more viable. The problem is not that the method may fail (i.e., resolve to non-existing URLs), the problem is the risk of false-positives, and these are not viable for an update solution. Your algorithm is furthermore less reliable than the API method. I like your other suggestion, though, although it's going to be used for a minority of cases.
@TheAssassin Got it , Thanks for the review.
@probonopd only because it doesn't fit your use case doesn't mean we may not implement a fix.
That's not what I said. Still I'd like to encourage everyone to consider ways to reduce the number of update checks to what is really needed. If the problem still persists, then of course we should solve it.
@probonopd please don't use this as a place to discuss this "phenomenon" then. This is entirely off topic.
@antony-jr I'd love to see you sending a PR implementing the first method, perhaps with the workflow changes I suggested which make it more efficient.
@antony-jr I'd love to see you sending a PR implementing the first method, perhaps with the workflow changes I suggested which make it more efficient.
@TheAssassin Sure do.
Not saying there may not be reason to check 61 AppImages for updates at once, but at least for me I never have the urge to...
@probonopd Say you're going to go out for a month and will not have reliable internet access at all. You wanna check your 97 AppImages for updates, but 77 of them are hosted on GitHub. After the 60'th, you're going to get errors. Then you're going to have to go out with 17 AppImages that might have not gotten updates.
What's even worse here is that GitHub may punish users for attempting to connect after their rate limit has been reached. Some random person who isn't aware of GitHub's API at all should not be getting punished because of something the application should be taking care of itself. Any application that uses GitHub's API should be aware of the rate limit and not allow the users to abuse it.
https://developer.github.com/v3/guides/best-practices-for-integrators/#dealing-with-rate-limits
Dealing with rate limits
The GitHub API rate limit ensures that the API is fast and available for everyone.
If you hit a rate limit, it's expected that you back off from making requests and try again later when you're permitted to do so. Failure to do so may result in the banning of your app.
You can always check your rate limit status at any time. Checking your rate limit incurs no cost against your rate limit.
Dealing with abuse rate limits
Abuse rate limits are another way we ensure the API's availability. To avoid hitting this limit, you should ensure your application follows the guidelines below.
Make authenticated requests, or use your application's client ID and secret. Unauthenticated requests are subject to more aggressive abuse rate limiting. Make requests for a single user or client ID serially. Do not make requests for a single user or client ID concurrently. If you're making a large number of POST, PATCH, PUT, or DELETE requests for a single user or client ID, wait at least one second between each request. When you have been limited, use the Retry-After response header to slow down. The value of the Retry-After header will always be an integer, representing the number of seconds you should wait before making requests again. For example, Retry-After: 30 means you should wait 30 seconds before sending more requests. Requests that create content which triggers notifications, such as issues, comments and pull requests, may be further limited and will not include a Retry-After header in the response. Please create this content at a reasonable pace to avoid further limiting.
We reserve the right to change these guidelines as needed to ensure availability.
Okay, I have another idea to fix the issue. The .zsync
file is auto generated by appimagetool, right? And appimagetool is what puts the version number in it, making a wildcard in the update information necessary?
But, in the end, we don't need a version number in the .zsync
file. Actually, the .zsync
file could use a static filename, that is the same for every new release. The version number can still be embedded in the AppImage's filename, but there is no technical reason to put it in the .zsync
file's name as well.
If we stopped to do so, we would not need to use the GitHub API (except for fetching the latest release), eliminating one API request, as soon as @antony-jr's first solution is implemented.
The version number can still be embedded in the AppImage's filename, but there is no technical reason to put it in the .zsync file's name as well.
@TheAssassin If thats possible then we don't really need to scrape anything and I suppose this goes against the AppImage Specification(Which states that you can use wild-cards in the update information for github). Please see that in the case of web scraping we look for the specific section in the source file with our assumption that there exists a link ( Which can change very frequently ) and so there is a very high chance of failure too , Instead of this assumption why don't we assume the direct url which could fail sometimes but can be fixed by just changing the url format ( Which is not changed frequently by github ).
https://github.com/(username)/(repo)/releases/(tag)
https://github.com/(username)/(repo)/releases/(tag)
) and store the response.<a> and </a>
tags.PROS:
CONS:
convert the update information to this url -> https://github.com/(username)/(repo)/releases/(tag)
Request the url(https://github.com/(username)/(repo)/releases/(tag)
) and store the response.
Create a regex expression for the response such that all wild cards in the filename of the update information are converted to .*
and all the static strings in the filename of the update inforamtion are converted to ([string])
. This should give a regex something like this -> (String1-).*(String2).*(-.AppImage\.zsync)
When the above newly created regex is used on the source file at the url(https://github.com/(username)/(repo)/releases/(tag)
) will give us the filename of the .zsync file which corresponds to the update information.
With the zsync filename in place we a assume a url which should work most of the time unless github changes the url format which is very rare if so that happens we can fix it by just editing the url format which is very easy to maintain compared to web scraping.
ZSYNC_FILE_URL = https://github.com/(username)/(repo)/releases/(tag)/ + (filename of .zsync file)
PROS:
CONS:
Anyways , I really think you should reconsider of using the first method ,I will be sending Two pull request , The first one will be the standard web scraping with libtidy-html5 and the second one will be the second method I suggested , You will get a clear picture when you see the code. Its upto you to choose whatever code that seems the best.
Do you think GitHub will appreciate being web scraped? I don't think so. It might even violate their TOS? If you want to do that in your clients you're free to, but I don't think we should do it in AppImageUpdate.
You may not scrape GitHub for spamming purposes, including for the purposes of selling GitHub users' personal information, such as to recruiters, headhunters, and job boards.
@probonopd I think we are good , Because we only scrape public data for a good purpose. But if you are concerned about this then we can stick with github tokens. I'm not a big fan of web scraping , I really like to keep everything simple.
You may scrape the website for the following reasons:
Researchers may scrape public, non-personal information from GitHub for research purposes, only if any publications resulting from that research are open access.
Archivists may scrape GitHub for public data for archival purposes.
Which states that you can use wild-cards in the update information for github
We don't remove this functionality, we just reduce the amount of times it's used in. That's the idea behind it.
@antony-jr thanks for offering to implement web scraping, however it's not very likely to be merged.
@antony-jr thanks for offering to implement web scraping, however it's not very likely to be merged.
No problem. I was pretty concerned about using web scraping myself , Its really a bad idea for a auto updater. Then are you gonna use github tokens ? I guess thats the standard and the correct way to tackle this. I think there is more to it ( Like making the zsync filenames static ?)
@antony-jr I still think we should switch to static filenames for the .zsync
files, at least for the auto generated ones appimagetool creates. That'll cut requests by almost half in a reasonable amount of time, because we'd only have to check whether it's a "latest" version, or a static tag. In the latter case, we can then build a URL without even making a request to the API.
One key thing to remember here is that the rate limit isn't for just one session, or just your tool. So if you're using other tools (brew, cargo, go, etc) and you happen to do a run of updates you can quite quickly hit that rate limit. So even if you only have one appimage you can still end up affected by this problem.
I actually just hit this. I have a toolchain I run sometimes, as I just did, that runs over a bunch of items and updates them. Somewhere along the way I hit a point where it was too much, and so this time it hit the rate limit. Easy enough for the rest of my tools -- I just made a PAT available and everything else updated fine, but the AppImage updater failed. I was disappointed to see that the token isn't supported.
I understand not making it required -- I wholeheartedly agree with that. However, it should at least give you the option to use it.
Continued from https://github.com/AppImage/AppImageKit/issues/653#issuecomment-380662974
https://developer.github.com/v3/rate_limit/