Travis Build: add a plagiarism check

davcri commented 7 years ago

As discussed in https://github.com/freeCodeCamp/guides/issues/2503#issuecomment-338450570, we should add a plagiarism check at build time. Options that can be evaluated:

https://www.npmjs.com/package/plagiarism-checker
Plagiarism Checker (seems deprecated, but there seems to be alternatives: https://github.com/architshukla/Plagiarism-Checker/issues/29)
write a custom script

Ethan-Arrowood commented 7 years ago

As @QuincyLarson mentioned in https://github.com/freeCodeCamp/guides/issues/2503#issuecomment-339134886, it might be worth implementing this as a precommit hook to block users from even submitting a plagiarized PR in the first place.

But what if the precommit hook is wrong?

Then there should also be a --not-plagiarism flag that users could add to their commit statement to override the plagiarism checker.

And how can we stop users from abusing the flag?

We should outline in our contributing guidelines that if a user submits 3 or more PRs containing blatant plagiarized content they will be blocked from contributing to this project. We could of course set up an appeal process for users who accidentally or are unaware of what they did.

Bottom line is we need to establish a strict no-plagiarism policy and we need to enforce it. Furthermore, the PR number continues to grow like a wild fire and I believe if we were to implement some precommit hooks it would limit the number of bad-prs that land on this project. I know this may stifle some new contributors but is still important that we maintain this project with good contributing guidelines.

Like I stated before, I can't take the lead on this development at the moment, but I'd be happy to answer any questions and provide feedback on things other FCC contributors propose.

davcri commented 7 years ago

About the pre-commit hook, @Bouncey stated that:

This check would be better as a Travis check due to the amount opf PR's coming via the GitHub GUI. Pre-commit hooks only work when committing locally.

dhcodes commented 7 years ago

@davcri re: your question. I'm currently working on a probot-based PR bot that would work as follows:

User submits PR
PR gets tagged as content by reviewer--or auto-tagged based its place in the pages dir. This label triggers bot
Bot runs selected parts of diffed content through a Google Search
If certain percentage is found on another site, automatically adds a comment to the PR that says in nice terms that it's suspected to be plagiarized from <link to original site> and then autotags some label accordingly.

I need to think about how to exclude articles that use direct quotes or have adequate references.

I'd love to get your thoughts. As for running the existing PRs through it, we could probably temporarily change it's criteria to go through existing PRs but we'd need to pay a one-time cost to extend the Google Custom Search API beyond its 100/day query limit.

Right now the bot is running but has none of the functionality above. I'm currently working on only getting diffed text.

QuincyLarson commented 7 years ago

@dhcodes I think instead of trying to exclude articles that use direct quotes or have adequate references, this should be left to a human reviewer to decide. By raising awareness that the content comes from an external source, it will give PR reviewers a heads-up that they need to make sure things are properly cited. Then they can pass judgement as to whether further citation is necessary themselves.

davcri commented 7 years ago

@QuincyLarson yes this seems a perfect balance between the script (Probot in this case) and human effort.

@dhcodes I didn't know Probot ! Is the code (of your bot) hosted here on Github ? I could try to help, even if I'm not so experienced with JavaScript.

dhcodes commented 7 years ago

@davcri the code is here: https://aromatic-okra.glitch.me

Sorry for taking so long :(. It's most definitely a learning experience.

jp-sauve commented 7 years ago

Is citation enough? It obviously cures plagiarism, and where the sources have permissive licenses, it's enough. But for copyrighted content, there is a delicate balance between fair use and infringement, and it seems that it would come down to a few factors. Whether the FCC guide can be considered commercial, whether the copied material would deprive the copyright owner of profits, and the amount of content copied. For reviewers, the last bit seems to be important. Copying a whole page could fall outside of fair use, and so could copying many sections from the same site. In these cases, the method of citation, and whether there are quotes or not seems irrelevant. This seems like it could be relevant, as I've seen multiple PRs copying from mathisfun.com and techopedia.com

QuincyLarson commented 7 years ago

@jp-sauve Rather than delving into those legal questions you've posed, I think we follow our best judgement and operate under the principle that articles should be primarily original. We are not interested in cross-posting articles from the MDN, Stack Overflow, etc. The bulk of these articles should be original. I think it's better to have a short article that is just a single, attributed, paragraph-length quote than nothing. But an article shouldn't only consist of multiple quoted paragraphs. The author should try to add some context.

freeCodeCamp / guide

Travis Build: add a plagiarism check #3315