EbookFoundation / free-programming-books

:books: Freely available programming books
https://ebookfoundation.github.io/free-programming-books/
Creative Commons Attribution 4.0 International
337.62k stars 61.63k forks source link

would like to reopen #115 #692

Closed borgified closed 10 years ago

borgified commented 10 years ago

115 was an attempt to check broken urls

creating new issue for visiblity

i finally got around to writing the url checker. actually i had it finished a while back but i wasnt happy with it then. now, i've had some time to rewrite it, add some improvements and i think it's ready for usage. as it stands:

  1. it'll scan all the (free-programming-books*.md) books and go through each url.
  2. if url is bad, it creates a github issue here with that url. it includes the language (ie. ja, en, it) so you know which file to find the bad url. also it will include the original committer's login so that they are automatically notified (maybe they'll return and give us updated link in a PR)

for an example of what it might look like in action, check out here:

https://github.com/borgified/test_issue/issues?state=open

i just made a test repo to create the issues when the script finds bad urls.

my url checker script is here: https://github.com/borgified/url_checker

if all goes well, i plan to run this in redhat's openshift under cron job... maybe once a week might be enough?

there's a few other improvements i still need to make like:

  1. dont recreate new bad url issue if the same url issue already exists
  2. figure out how to run it inside openshift

let me know if you want me to run it once to start things off. i'm holding off because it will create a bunch of issues (there are a few bad urls) and i dont want to spam up the issues section without your blessing.

borgified commented 10 years ago

i just realized i need the ability to label issues too to make full use of all the features already in the script

vhf commented 10 years ago

This is great, thanks!

I think the first item on your todo list : "detect if the same issue has already been created previously (and avoid creating another)" is very important before we run this script on the whole repo.

Tell me when it's ready and I'll give you the rights to label issues on this repo. Thanks again

borgified commented 10 years ago

excellent. i've come up with a tentative solution, i guess ill discuss it here to see if anyone can figure out a better way.

so basically, no software is perfect and in the world of detecting "broken" links, this job is actually hard for a computer (but easy for humans). i do get some false positives sometimes (maybe due to server misconfiguration where the book is located) and for these cases, my initial thought is to leave those issues open but with a comment to let humans know that the link is actually good.

this way, with the issue open, i will know not to create additional issues for that same url. we just have to make a point of not closing these issues so that i dont hit these false positives again. the reason why i wanted to do it this way is because i wanted to keep things simple on my end and not have to keep a file or a database of checked urls. that way, if anyone else wants to scan for broken urls on their own using my program, they can do so without any additional dependencies.

im ready to merge this solution to the main program, probably get that done by tomorrow sometime but ill wait to see if there are better solutions.

borgified commented 10 years ago

ok all set on my side. toss me the access to label issues (includes creating new labels) and i think we're good to go for our first run. i'll monitor as the script runs in case anything goes wrong (we can abort before it creates a huge problem)

vhf commented 10 years ago

Great, I'll add you shortly.

borgified commented 10 years ago

ok its running. i see a ton of false positives didn't expect so many but i guess it cant be helped. when the server tells me 503 error (service unavailable) but still returns the webpage i dont have a good way to figure that out yet.

in light of all the false positives, im closing those issues to get them out of the way. but first, im labeling them as "false positive" so that the next time it runs, it wont regenerate a new issue for it.

borgified commented 10 years ago

1395 urls checked 33 false positive 8 actual bad url

actually my program bombed out due to some bad regexp (there were some '(' and ')' in the url itself for this link: http://en.wikibooks.org/wiki/Clipper_Tutorial:_a_Guide_to_Open_Source_Clipper(s)

so it aborted before it could finish the rest. i'll continue to improve the program so there are much fewer false positives, then i will run again.

esparta commented 10 years ago

Can we (ok, I) see the code? << NEVERMIND, I'm seeing it. This exercise is fine to do a "first time" check, publish the list and do a manual revision. But also we can use script to do test and check all the links, integrate this repo to travis-ci, so, every pull request can be checked by them. What do you think about it?

esparta commented 10 years ago

If you prefer the bot path, please, do an one-issue approach to post the dead links, perhaps 1 by week/month.

borgified commented 10 years ago

sure yea.. i actually posted link of it in my first comment. here it is again: https://github.com/borgified/url_checker

i actually did try the travis-ci approach 4 months ago: https://github.com/vhf/free-programming-books/pull/115

but i forget why i abandoned that idea...

good suggestion on the one issue approach will give it a shot

im remembering more why i didnt want to do travis-ci. i think one of my ideas was to actually fetch the content of each webpage and store the sha in a database so i can verify if the site "changed". this would detect those fringe cases where the link to book is still "good" but there's actually no book there.

another advantage of storing the results in a database are for those unreliable links that are "good" for the majority of the time but maybe for some unlucky reason are bad when it is being tested. then having a record of when it was tested and how often it is good might factor into whether an issue should be created for it. (maybe a more reliable link could replace it)

in the end i think i decided that it was way too convoluted and didnt want to introduce a database, so i opted for this simpler approach that just uses issue labels to store information about the link for me.

esparta commented 10 years ago

I was trying to do the regex for the corner case you mentioned: http://en.wikibooks.org/wiki/Clipper_Tutorial:_a_Guide_to_Open_Source_Clipper(s) , I though I got it with this:

\[(.*?)\]\((\S+(?=\)))\)

But then, you will hit other(s) corner case(s):

* [Computer Networking: Principles, Protocols and Practice, 2nd edition (CNP3bis)](http://cnp3bis.info.ucl.ac.be/) (PDF, EPUB + [sources](https://github.com/obonaventure/cnp3/tree/master/book-2nd)) - O. Bonaventure (in progress)

This is what I was trying: http://regex101.com/r/aR7sS9 (see the match 4) Conclusion at the moment: It doesn't worth the effort to do with regex, in the meanwhile I'm back the basic one: http://regex101.com/r/jX5yG5:

\[(.*?)\]\((.*?)\)

This need a proper parser, and maybe we don't need this kind of complications :)

borgified commented 10 years ago

Nice :)

3 hours into regexp ocean, I was completely lost in the fog of recursive regexps without getting any closer to the answer... then i found URI::Find. I guess someone else had figured it out already.

After a couple tests, I'm convinced that URI::Find will extract every last url from each book, regardless of how many parentheses you throw at the url or how many urls there are per line.

There's still some work to be done though, it's not perfect as you can see some issues like this: http://www.jsoftware.com/books/pdf/brief.pdf)(PDF becomes http://www.jsoftware.com/books/pdf/brief.pdf)(PDF But I suspect that the reason why these show up is because they are valid URLs and maybe the "right" approach would be to edit the input and add a space so it looks like this: http://www.jsoftware.com/books/pdf/brief.pdf) (PDF becomes http://www.jsoftware.com/books/pdf/brief.pdf) (PDF

Then, I'm pretty sure it'll just ignore the (PDF part.

see https://gist.github.com/borgified/8996783 for list of extracted urls

I'm working in this branch: https://github.com/borgified/url_checker/tree/regexp_parens_in_url

I've already exorcised any and all issue creations (bad idea), just prints to STDOUT for now, considering uploading to a gist then maybe creating a single issue linking to that gist containing all the broken urls.

borgified commented 10 years ago

gonna copy paste the stdout of my script to create issues for now, not much advantage to output to a gist