cc-archive / cccatalog-frontend

[PROJECT TRANSFERRED] CC Search is a search tool for CC-licensed and public domain content across the internet.
https://github.com/WordPress/openverse-frontend
MIT License
162 stars 196 forks source link

[META] User Flag Functionality #848

Closed annatuma closed 4 years ago

annatuma commented 4 years ago

This is a meta ticket, for a feature that requires work across the frontend, catalog, and API.

Overview & Concept

We’d like to introduce a feature to CC Search to allow users to report problematic content quickly and easily. This is important for several reasons:

This feature will require an addition to the CC Search UI for users to interact with, as well as backend support for recording user reports and taking action on those. It is important to note that nothing happens automatically to content when it is reported by a user. Any decision to hide or remove content from CC Search or mark it as NSFW in the Catalog will be made by CC Staff.

Frontend functionality

Reporting UI

Initial Wireframes Figma Mockup to base flow on

We will add a link to the top right corner of the Single Results Page, with a Flag icon, and the text “Report”.

Clicking on “Report” pops out a modal, which contains the following:

What is wrong with this {content type:image}?

These are the frontend interactions that happen for each of these selections:

“{content type:Image} infringes copyright.” In a new tab, load the DMCA form In modal, show Thank You message

“{content type:Image} is Mature Content.” In modal, show Thank You message

“Other.” New screen loads in modal, with: Header: “Please describe the issue for us”. Large text input field. [Submit Issue] button After user hits [Submit Issue], show Thank You message

Note: we’ll display {content type} where curly brackets are, based on the content type of the work on the page that the button renders on, on individual result pages. Ticket: https://github.com/creativecommons/cccatalog-frontend/issues/425

Filter

We want to add a filter, at the very bottom of the filter list, called “Search Settings”. It should have one option for users to select, called “Show Mature Content”. Ticket: https://github.com/creativecommons/cccatalog-frontend/issues/435

Backend process

Recording User Actions

These are the backend interactions that happen for each of the frontend user selections:

“{content type:Image} infringes copyright.” Add to table with “DMCA flag” and outclick indicator Currently, Legal takes actions based on response to DMCA form handled via email.

“{content type:Image} is Mature Content.” Add to table with “Mature” flag

“Other” and text submission: Add to table with issue description in field.

All reported content is stored in a table as soon as it is reported by a user. Two things happen when new content is added to the table:

Table Fields

This is the information we should store in the table:

  1. CC Search unique ID
  2. CC Search record URL
  3. Date and time
  4. Report Type: dmca (if we include this), illegal, mature, other
  5. Report Description (applies only to “other” reports)
  6. Reviewed: yes/no

Table Interface Options

After content is reviewed, we’d like to continue to store it in the table, so we have a record of all user reports. However, any content already reviewed and therefore marked with “no” should by default not be shown (i.e. be filtered out) from the Django admin interface. A backend user of the interface should be able to select it for inclusion, should they need to review prior actions.

The following actions should be available for each row in the table:

Mark Mature

De-Index Content

Do Nothing Clicking this only changes the status of the record from “no” (not yet reviewed) to “yes” (reviewed).

After any available action is taken, the status of the record is updated from “no” (not yet reviewed) to “yes” (reviewed).

Data Processing

As indicated in the Table Interface Options, data processing needs to take place based on the actions taken.

We need to write a script, set to run every 12 hours, that looks for rows where the status of the record has been set to “yes” since the script last ran, and where the action taken in the table is either “Mark Mature” or “De-Index Content”.

Any record with the action “Mark Mature” should: Have “Mature” added as metadata to the content record in the Catalog

Any record with the action “De-Index Content” should: Use the existing endpoint (https://github.com/creativecommons/cccatalog-api/issues/294) for copyright takedown.

The feature spec, including the Internal Process and Future Iterations, is available to CC Staff here.

annatuma commented 4 years ago

Outstanding action items: 1) @aldenstpage we need a ticket (or tickets) for the backend processes here. Please review the spec, create ticket(s), and update the description to include those.

2) @kgodey @brenoferreira does the backend work need to be complete before any work can be done on the UI? If so, please add "blocked" labels to the frontend tickets.

3) @mathemancer can you confirm that we are indeed pulling in mature/NSFW metadata from some sources, and these can be referenced when @aldenstpage works on https://github.com/creativecommons/cccatalog-api/issues/339?

aldenstpage commented 4 years ago

Hey @annatuma, I've updated the tickets with the required work for the backend. ~Could you take a look at https://github.com/creativecommons/cccatalog-api/issues/474 and fill out the list potential reasons users will be able to report images for the NSFW action interface?~ All of the details I need are here, thanks!

kgodey commented 4 years ago

We're going to assign this to everyone on the team in order to review this in this order. Once you're done reviewing it, please assign it to the next person in the queue: @aldenstpage @mathemancer @brenoferreira @kgodey

aldenstpage commented 4 years ago

(Deleted my question; I read the DMCA page and see that there's a Google form, and understand the desired DMCA flow now)

mathemancer commented 4 years ago

On the NSFW front, I'm not aware of any source where we're pulling content that we know is mature.

mathemancer commented 4 years ago

I'm quite concerned about potential abuse of the flagging system. This is a common vector for censorship attacks on the internet at the moment. I'd like to make sure we have the necessary data to detect such an attack, and to prevent abusive flagging.

On the other hand, preventing abuse of the flagging system could entail collecting data to identify abusive flaggers, and I am quite uncomfortable with keeping any data on users. I think we should make sure that we don't collect any data without good reason, and fair warning to the user.

One option might be storing an anonymized IP address in the table as well (with a large bold warning that the data will be stored on the submission form).

The alternate idea I had for that was to ask for an email, and send a one-time link to the 'real' form to the email they submit (thereby validating the email). We could then store an anonymized version of the validated email address.

mathemancer commented 4 years ago

Any record with the action “Mark NSFW” should: Have “NSFW” added as metadata to the content record in the Catalog

Any record with the action “De-Index Content” should: Use the existing endpoint (creativecommons/cccatalog-api#294) for copyright takedown.

To my understanding, the scope of this (meta) issue is limited to getting info into the DB at the service layer. @annatuma Is that correct? If so, do we want further issues to propagate that info back into the data layer, or should it stay in the service DB only? If so, we won't be able to use the meta_data field for holding the NSFW tag. We'd need some other plan there.

brenoferreira commented 4 years ago

LGTM.

@annatuma the frontend work can start without the API work having to be complete. The API spec is enough for the work to start.

kgodey commented 4 years ago

“{content type:Image} is Adult Content.”

I think we need to standardize on a single terminology here. The reporting calls it "Adult", the backend flag is "NSFW", and the filter is "Mature". This is going to cause confusion later. We should call it one thing everywhere.

We need to write a script, set to run every 12 hours, that looks for rows where the status of the record has been set to “yes” since the script last ran, and where the action taken in the table is either “Mark NSFW” or “De-Index Content”.

We're building API endpoints for this. Why do we want to run it every 12 hours instead of immediately? cc @aldenstpage

Additionally, I share @mathemancer's concerns about trying to identify and minimize abuse of the reporting functionality. We should address this in the design. I like the idea of storing anonymized IP addresses, but we need to update the mockup to clearly let people know that we're storing their anonymized IP address.

kgodey commented 4 years ago

@annatuma assigning back to you for review of all the comments and potential updates to the spec.

annatuma commented 4 years ago

Any record with the action “Mark NSFW” should: Have “NSFW” added as metadata to the content record in the Catalog Any record with the action “De-Index Content” should: Use the existing endpoint (creativecommons/cccatalog-api#294) for copyright takedown.

To my understanding, the scope of this (meta) issue is limited to getting info into the DB at the service layer. @annatuma Is that correct? If so, do we want further issues to propagate that info back into the data layer, or should it stay in the service DB only? If so, we won't be able to use the meta_data field for holding the NSFW tag. We'd need some other plan there.

I'd prefer engineering input on what makes sense here. What's the use case for storing this in the data layer? Is there a reason the service layer doesn't suffice? Service layer is clearly enough for the frontend, but if there are reasons this should go back to the data layer we should evaluate those. @mathemancer @aldenstpage @kgodey please weigh in.

annatuma commented 4 years ago

“{content type:Image} is Adult Content.”

I think we need to standardize on a single terminology here. The reporting calls it "Adult", the backend flag is "NSFW", and the filter is "Mature". This is going to cause confusion later. We should call it one thing everywhere.

Fine by me. Updating it to "Mature" everywhere.

We need to write a script, set to run every 12 hours, that looks for rows where the status of the record has been set to “yes” since the script last ran, and where the action taken in the table is either “Mark NSFW” or “De-Index Content”.

We're building API endpoints for this. Why do we want to run it every 12 hours instead of immediately? cc @aldenstpage

Immediately would be better, the 12 hour run came out of our earlier conversations about this feature. That said, whatever @aldenstpage says is feasible here (12 hours versus immediate) is fine by me.

Additionally, I share @mathemancer's concerns about trying to identify and minimize abuse of the reporting functionality. We should address this in the design. I like the idea of storing anonymized IP addresses, but we need to update the mockup to clearly let people know that we're storing their anonymized IP address.

I'll sync with @sarahpearson about the appropriate way to do this.

kgodey commented 4 years ago

@aldenstpage please update the API spec to say "mature" instead of "nsfw" per @annatuma's update above.

aldenstpage commented 4 years ago

To my understanding, the scope of this (meta) issue is limited to getting info into the DB at the service layer. @annatuma Is that correct? If so, do we want further issues to propagate that info back into the data layer, or should it stay in the service DB only? If so, we won't be able to use the meta_data field for holding the NSFW tag. We'd need some other plan there.

I envisioned this as getting propagated back to the data layer through Kafka as described here; in my mind you would handle these events by setting the appropriate flag in the meta_data field but you can represent it however you want.

An alternative design is to keep all logic related to mature content and DMCAs in the API layer in separate tables. We could perform mature content detection during ingestion (or in the future, a proper ETL pipeline). However, if you are able to detect that something is marked mature upstream (e.g. you find a way to get the NSFW flag from the Flickr API), you should still mark it in the meta_data field (or wherever else) so we can flag those images during ingestion. This would remove all responsibility for handling DMCAs and flagging from the data layer. I'll take this approach instead if you don't have any objections.

Additionally, I share @mathemancer's concerns about trying to identify and minimize abuse of the reporting functionality. We should address this in the design. I like the idea of storing anonymized IP addresses, but we need to update the mockup to clearly let people know that we're storing their anonymized IP address.

We temporarily store IPs on our servers for rate limiting purposes and store them with the last octet erased in our server logs for up to 3 months; we should have the technical means to catch abuse already. Data retention practices are described in our privacy policy.

We're building API endpoints for this. Why do we want to run it every 12 hours instead of immediately? cc @aldenstpage

There's no technical reason for this to take 12 hours, particularly if we are able to avoid having to track NSFW status in both the catalog and API layers.

mathemancer commented 4 years ago

Long term, I'd want the flags to be in the data layer, since that would be part of any pipeline where we, say, used ML to figure out if other unflagged pictures might be mature content. Ideally, any metadata that was just about displaying the image would stay in the service layer, and any metadata that was 'inherent' to the image would end up in the data layer eventually. I think whether the image is considered 'Mature' or not falls under the latter.

But, we don't have a specific use for the 'Mature' flag in the data layer at the moment, so maybe I'm just falling prey to YAGNI fallacies.

aldenstpage commented 4 years ago

Your last point is a good one; we can always extract the data later for training. I think it is also debatable whether matureness is inherent since it is highly contextual to culture and personal bias; we're taking a "stance" that an image is mature based on what we think our audience wants, and that's probably going to be a moving target. I'll stick with keeping it internal to the API layer now and we can bridge that to the catalog later if the need arises.

annatuma commented 4 years ago

@brenoferreira we have the text to add to the first screen of the user reporting form regarding collection of IP addresses:

"For security purposes, CC collects and retains anonymized IP addresses of those who complete and submit this form."

@panchovm is updating the mockup right now. Thanks to @sarahpearson for review and language.

fcoveram commented 4 years ago

The mockup is now updated with this text

annatuma commented 4 years ago

Everything is live in production and looks good.

We have a couple of minor follow-up issues (as is to be expected).