freeCodeCamp / guide

A collection of easy-to-understand guides to programming tools
https://guide.freecodecamp.org
2.14k stars 4.22k forks source link

Add citation suggestion to README.md #2503

Closed dhcodes closed 5 years ago

dhcodes commented 6 years ago

I worry some of the new content directly plagiarizes other sites. IMHO, we should work on a recommended way to cite other sources and discourage unattributed copying/pasting.

Thoughts?

bryanchapel commented 6 years ago

Haha! I tweeted Quincy about this just yesterday and already put in a pull request for on particular article I found. This is how I did it: https://github.com/freeCodeCamp/guides/pull/2337/files?short_path=230a905#diff-230a9052be3f27a5607aea2debfbf534

dhcodes commented 6 years ago

I like it @bryanchapel; Good work! We need to write it into the README assuming others agree. Would you like to do that once everyone has a chance to comment?

bryanchapel commented 6 years ago

Can do!

bryanchapel commented 6 years ago

Since I'm kind of new to this, about how long should you wait before making a decision on an issue? I made a change to the README and referenced this issue in the commit above to help the discussion along. Let me know how it looks. :)

davcri commented 6 years ago

I agree whit all you said!

Only one note: we should also pay attention to license and term of service. For example someone opened a pull request with copy pasted text from Quora that has this long ToS. I don't know if this is legal, but one thing is sure: this is not ethical!

davcri commented 6 years ago

I found another PR Algorithm: Add AVL Tree Article with content copied from tutorialspoint.

It's a sad situation and it's difficult to not discourage contributors, but we are trying to create a community content, not a copy-paste website.

lvcoulter commented 6 years ago

Not sure if this conversation is open, but I'm throwing in my thoughts. I wrote my first article last night. I was also an English teacher in my last incarnation, so I struggled with how to write concise content without it sounding exactly like my favorite resources. I opened several sources on the same topic and read them all, then closed them and wrote all my thoughts down without looking at any. That might be a good suggestion for your README. I did find that my own voice came out very similar to that favorite I mentioned but I I made sure to change up the examples and insert my own examples. Hope this helps.

Ethan-Arrowood commented 6 years ago

college student perspective: My university classes are all governed by very strict no-plagiarism rules. I think the guides should reflect similar values as other content creators work hard to produce their work and if we are not giving them proper citation then that is extremely unfair.

We could ask everyone to use something like APA citations (its already a scientific standard). This would not only benefit in making the guide more professional, but also provide valuable SEO link-backs to creators.

bryanchapel commented 6 years ago

@Ethan-Arrowood Agreed. Didn't even think of the SEO benefits of this, haha!

@lvcoulter I think you're highlighting the difference between paraphrasing what you've researched and learned, and directly quoting. I think that's totally appropriate. My only suggestion would be to collect some of those references and stick them in a ### Other Resources section near the end of the doc so readers can further explore the topics.

@davcri I think the most important piece of Quora's ToS is this:

Subject to these Terms, Quora gives you a worldwide, royalty-free, revokable, non-assignable and non-exclusive license to re-post any of the Content on Quora anywhere on the rest of the web provided that the Content was added to the Service after April 22, 2010, and provided that the user who created the content has not explicitly marked the content as not for reproduction, and provided that you: (a) do not modify the Content; (b) attribute Quora by name in readable text and with a human and machine-followable link (an HTML anchor tag) linking back to the page displaying the original source of the content on http://quora.com on every page that contains Quora content; (c) upon request, either by Quora or a user, remove the user's name from Content which the user has subsequently made anonymous; (d) upon request, either by Quora or by a user who contributed to the Content, make a reasonable effort to update a particular piece of Content to the latest version on http://quora.com; and (e) upon request, either by Quora or by a user who contributed to the Content, make a reasonable attempt to delete Content that has been deleted or marked as not for reproduction on quora.com.

Also a good point I didn't think of. User's should be away of a resource's ToS and the concepts of fair use.

I'll add these suggestions to the README changes I'm proposing in the commit referenced above. Good discussion all!

bryanchapel commented 6 years ago

On APA vs MLA formatting for citations, I agree that we could use APA since it's typically the scientific standard. I do like that the MLA specifies a "Date Accessed" for the citation. I think that's really helpful for possibly spotting references that may have changed or gone out of date since a topic was added to the guide, and we can amend/update these as needed.

Also, listing the link in the citation, as APA recommends, is a bit redundant as we should be creating a link to the resource itself in the markdown. I think that convention applies more to print/non-web citations.

I think the best format for our purposes would look something like this:

Author Last Name, Author First Name (if listed). "Article Title." Publication. Publisher. Date Published(if listed). Date Accessed.

And in the markdown it would look like: [Author Last Name, Author First Name. "Article Title." *Publication.* Publisher. Date Published. Date Accessed.](https://LinkToSource.com)

Maybe we could do a bit of both? Thoughts?

davcri commented 6 years ago

@QuincyLarson @Bouncey @HKuz @Timo (I'm tagging top contributors, they are more experienced than me): can you please give feedback ? I think this is a delicate subject.

dhcodes commented 6 years ago

I think direct copy/paste should generally be discouraged unless directly quoted and integral to the article. Paraphrasing, which is what @lvcoulter is doing, it a-okay by me assuming we cite sources. As mentioned, I like @bryanchapel's approach as it's similar to Wikipedia. We could even model wikipedia's citation format if we wanted. I'm not sure it matters truly if it's APA or MLA as long as we give credit where credit is due.

QuincyLarson commented 6 years ago

@dhcodes Sorry I'm so late to this thread.

Here are my thoughts on this: by forcing contributors to abide by a style guide, we're making it harder to contribute. Such a style guide should instead by enforced through an automated script. Just like we use ESLint for our JavaScript, we should use a style checker for our citations.

And we should tackle plagiarism the same way: by running a build script.

That way, if the build task detects what might be plagiarism, a human can look at it and make sure it's properly attributed.

Here's a library that does this. It hasn't been touched in a couple years, but we might be able to make it work. It's in Python, so @Ethan-Arrowood might be a good candidate for testing it out and seeing if we can get it running and incorporated into TravisCI: https://github.com/architshukla/Plagiarism-Checker

Again, my sentiment is we should put up as few rules and as few impediments to contributing as absolutely necessary. And those rules should be enforced at the CI-level that's transparent and consistent.

QuincyLarson commented 6 years ago

@davcri Thanks for spotting that case of clear plagiarism. I've closed that contributor's pull request and also reverted another PR from them that I spotted which had plagiarism. I gave him a stern one-time warning (the notion of plagiarism is less familiar in some parts of the world and I gave him the benefit of the doubt).

If we spot people plagiarizing, we should give them a one-time warning that they will be banned from contributing to the freeCodeCamp GitHub organization if they're caught again, and we should refer them to the Academic Honesty Policy: https://www.freecodecamp.org/academic-honesty

Ethan-Arrowood commented 6 years ago

I like the idea of a plagiarism-checker. The python module @QuincyLarson linked is now broken due to the Google API it was using being deprecated. Furthermore, it would be easier to run a Node script through the Travis Build anyways. . . So I propose we add a Plagiarism-Checker Node.JS script as a down-the-road feature. However, at the moment I am way too busy to start this project. I have a lot on my plate including interview prep, university work, and personal projects (started my own OS project this week). If no one else wants to take up the lead on this I can create a blank repo and begin work in a few months once my life calms down a bit.

In the meantime I think the best course of action would be to write a CONTRIBUTING.md that highlights the basics to contributing as well as some additional details such as our stance on plagiarism and citations. Here is a good resource (includes examples) on how to properly set up a CONTIBUTING.md file.

bryanchapel commented 6 years ago

I agree about the checker script as well. There is a Node version by Copyleaks (https://www.npmjs.com/package/plagiarism-checker) that we might be able to use. I also think that writing one from scratch might not be that hard.

You could use the request-promise and cheerio libraries to send chunks of the committed text to Google, then parse the first 10 or so results and check the text chunk against it for a fuzzy match. If there's, say, a 60% or something similarity, the PR gets flagged as needing review. Everything I've found so far was the first hit returned by Google when I copied and pasted parts of the article. See this article on unconscious bias as an example. This user might also need a warning, as outlined by @QuincyLarson above? I put in a PR to fix their issue with citations already.

At any rate, I added a note about the Academic Honesty Policy to my README commit, in addition to the stuff I added about proper attribution. Let me know how this looks, or if it should be pulled out into a separate CONTRIBUTING file as @Ethan-Arrowood suggests. Might even be best to mention it in both places just so it's clear and people don't have an excuse to say "I didn't see that guideline".

davcri commented 6 years ago

@bryanchapel did you already make a PR with your updated README ? If not, can you make it ? In this way we can discuss it (for me it's almost all right, I have only a doubt in using HTML tags vs markdown).

I vote for writing about the Honesty Policy in both the README and the CONTRIBUTING files.

I also opened a new issue to discusse about adding a plagiarism check inside the Travis Build https://github.com/freeCodeCamp/guides/issues/3315

bryanchapel commented 6 years ago

Just made the PR. #3371. This is just for the README. Didn't do anything for CONTRIBUTING.

dhcodes commented 6 years ago

I added some small edits.

QuincyLarson commented 6 years ago

@bryanchapel @dhcodes I've merged your edits! Thanks! We should mirror this in CONTRIBUTING to make sure people see it. Then I believe we can close this issue.

QuincyLarson commented 6 years ago

@bryanchapel Nice find on the plagiarism checker! Yes - we would absolutely love your help implementing this.

Seeing that @Ethan-Arrowood is a bit busy at the moment, and has determined that the Python library definitely doesn't work, you're now our only hope on this.

dhcodes commented 6 years ago

I've looked into this a bit more and assuming we use a comparison search via a search engine (google or bing), we may need to limit the test to only files changed in the PR since the free plan for Google Custom Search now limits you to 100 queries/day. I've looked for alternatives, but there aren't many--Bing also has removed their free plan.

I know Jest can run tests only on changed files, but I'm looking at alternatives as well. I'm not sure if this is a setting on Travis. Still researching.

QuincyLarson commented 6 years ago

@dhcodes Yes - I agree. We should only test files changed in the PR.

@Ethan-Arrowood pointed out that we might want this to be part of our pre-commit step, so that we can point out possible plagiarism to the contributor before it even gets committed. Then if the contributor thinks there's a false positive, they could run the commit task again with --not-plagiarism and it would skip this step, but add a note to the commit description like "plagiarism check skipped" so we'd know to eye-ball their contribution for anything suspicious before accepting the PR.

Bouncey commented 6 years ago

@dhcodes Travis can run any script you give it, we just have to write it. If any check we make fails you can process.exit(1) and Travis will mark the commit as build failed

@QuincyLarson This check would be better as a Travis check due to the amount opf PR's coming via the GitHub GUI. Pre-commit hooks only work when committing locally.

dhcodes commented 6 years ago

@Bouncey yeah I think based on what everyone has said, it may be best to go the PR bot route. I'm currently working on making one in probot, but I'm slow so if someone else wants to give it a go, by all means, go for it.

dhcodes commented 6 years ago

I probably wasn't clear enough in my last post. If anyone wants to work on this, consider it open. There's no assurance that I'll get anything working and there are many other skilled programmers out there who could probably whip something up faster than I.

FrancesCoronel commented 6 years ago

Just to note, there are some PRs that reference this issue but in the interest of maintaining positive contributions, I am marking the PRs that have the majority of the content copied and pasted from external websites as invalid and closing them.

To restate what @davcri has said and what I ultimately agree with, "we are trying to create community content, not a copy-paste website".

davcri commented 6 years ago

@dhcodes could you provide some insights of the PR bot here https://github.com/freeCodeCamp/guides/issues/3315 ? I would like to know more about them.