Anonymize data for projects server side

friep commented 11 months ago

for projects where status == published_anon:

organization data

only keep sector and legal form

respectively:

org.short_id = null
org.translations = [] // deletes description, website
org.id = null

project outputs

set to [] (avoid spilling information contained in those)

blog posts

set to [] (dito)

podcast

set to [] (dito)

project people

set to []

the last three are just a precaution - this shouldn't happen that we have a podcast episode and have to anonymize. but better be safe than sorry :)

follow up to #457

KonradUdoHannes commented 11 months ago

From a general safety perspective I would suggest to implement the such that it whitelists what should not be anonymized, rather than blacklisting what should be. If the default is anonymous accidental exposure of sensitive information is a little less likely. The downside is of course that a project might more easily be anonymous by mistake, but I would expect this to be less critical.

friep commented 11 months ago

agree on whitelisting 💯 will specify this tomorrow.

friep commented 10 months ago

here's the construction of the project objects:

  projects = projects.filter(proj => ["published", "published_anon"].includes(proj.status));

  let new_objs = [];
      // for each project
      projects.forEach((proj) => {
        // extract whether project is anonymized
        let is_anon = proj.status == "published_anon";

        // flatten organizations and LCs
        proj.Organizations = [...proj.Organizations.map((o) => o.Organizations_id)];
        proj.Local_Chapters = [
          ...proj.Local_Chapters.map((o) => o.Local_Chapters_id),
        ];
        // filter out not public outputs
        proj.Projects_Outputs = proj.Projects_Outputs.filter(
          (out) => out.is_public
        );

        // anonymize People, Posts, Outputs, Podcast
        proj.Projects_Outputs = is_anon ? [] : proj.Posts;
        proj.Posts = is_anon ? [] : proj.Projects_Outputs;
        proj.People = is_anon ? [] : proj.People;
        proj.Podcast = is_anon ? null : proj.Podcast;

        let orgs = [];
        // organizations
        proj.Organizations.forEach((org) => {
          let reduced_org = new Object();

          reduced_org.id = is_anon ? -99 : org.short_id;
          reduced_org.short_id = is_anon ? "ANO" : org.short_id;
          reduced_org.legal_form = org.legal_form;
          reduced_org.sector = org.sector;

          if (is_anon) {
            reduced_org.translations = [];
          } else {
            reduced_org.translations = org.translations.map((trans) => {
                return { 
                    name: trans.name,
                    website: trans.website,
                    description: trans.description
                }
            });
          }

          orgs.push(reduced_org);
        });
        proj.Organizations = orgs;

        // local chapters
        proj.Local_Chapters = proj.Local_Chapters.map((lc) => {
            return { short_id: lc.short_id, founded: lc.founded } 
        });

        // translations

        if (is_anon) {
            proj.translations = proj.translations.map((trans) => {
                return { title: trans.title, summary: trans.summary }
            });
        } else {
            proj.translations = proj.translations.map((trans) => {
                return { title: trans.title, summary: trans.summary, description: trans.description }
            });
        }

        // returning new object
        let proj_obj = (({
          id,
          status,
          start_date,
          end_date,
          project_status,
          team_size,
          is_internal,
          data,
          type,
          language,
          Organizations,
          Projects_Outputs,
          Podcast,
          People,
          Posts,
            translations
        }) => ({
          id,
          status,
          start_date,
          end_date,
          project_status,
          team_size,
          is_internal,
          data,
          type,
          language,
          Organizations,
          Projects_Outputs,
          Podcast,
          People,
          Posts,
            translations
        }))(proj);

        new_objs.push(proj_obj);
      });

friep commented 10 months ago

please refactor for better code - my javascript skills are super rusty!

KonradUdoHannes commented 10 months ago

@friep sure, no problem. This will happen automatically when integrating with our existing parsing functionality.

Regarding the issue general, I see two related tasks. Anonymizing projects for displaying project cards and anonymizing projects for displaying project subpages. I'll start with the cards, but I was wondering whether the subpages for anonymous projects is actually a required use case. @friep do you know whether this will be the case? If so I'll probably create and link a separate issue as we would need to adjust the project page design for that case.

When I checked this morning directus only had one anonymous project whiteout a subpage, which is a good starting point for testing, but I assume there is more to come.

friep commented 10 months ago

yes, there will be more projects like this (actually probably the majority of projects..)

re design: related is #231 where Jonas and me discussed the separate page. we would then link from the daten-nutzen page to the other page which would also get its own dropdown menu entry. there we had discussed that each project would get a subpage due to the amount of space needed even for anonymous projects (e.g. the project summary can get quite long).

definitely this issue is only for adopting the server side code. I assumed you could complete this independently (e.g. by filtering client-side for published ) before doing client side things like design.

friep commented 10 months ago

made progress on #470 but would wait on working on it further until you have implemented your solution here.

here's the full code of the relevant directus flow building block:

module.exports = async function(data) {
    // Do something...
    // only projects that are published or anonymized published
    projects = data.read_projects;
    projects = projects.filter(proj => ["published", "published_anon"].includes(proj.status));

    let new_objs = [];
    projects.forEach((proj) => {
        // extract whether project is anonymized
        let is_anon = proj.status == "published_anon";

        // flatten organizations and LCs

        proj.Organizations = [
            ...proj.Organizations.map((o) => o.Organizations_id),
        ];
        proj.Local_Chapters = [
            ...proj.Local_Chapters.map((o) => o.Local_Chapters_id),
        ];
        // filter out not public outputs
        proj.Projects_Outputs = proj.Projects_Outputs.filter(
            (out) => out.is_public
        );

        // anonymize People, Posts, Outputs, Podcast
        proj.Projects_Outputs = is_anon ? [] : proj.Projects_Outputs;
        proj.Posts = is_anon ? [] : proj.Posts;
        proj.People = is_anon ? [] : proj.People;
        proj.Podcast = is_anon ? null : proj.Podcast;

        let orgs = [];
        // outputs
        proj.Projects_Outputs.forEach((output) => {
            output.translations = output.translations.map((trans) => {
                return {
                    language: trans.languages_code.code,
                    description: trans.description,
                };
            });
        });

        // organizations
        proj.Organizations.forEach((org) => {
            let reduced_org = new Object();

            reduced_org.id = is_anon ? -99 : org.id;
            reduced_org.short_id = is_anon ? "ANO" : org.short_id;
            reduced_org.legal_form = org.legal_form;
            reduced_org.sector = org.sector;

            if (is_anon) {
                reduced_org.translations = [];
            } else {
                reduced_org.translations = org.translations.map((trans) => {
                    return {
                        language: trans.languages_code.code,
                        name: trans.name,
                        website: trans.website,
                        description: trans.description,
                    };
                });
            }
            orgs.push(reduced_org);
        });
        proj.Organizations = orgs;

        // local chapters
        proj.Local_Chapters = proj.Local_Chapters.map((lc) => {
            return { short_id: lc.short_id, founded: lc.founded };
        });

        // translations / project description
        if (is_anon) {
            proj.translations = proj.translations.map((trans) => {
                return {
                    language: trans.languages_code.code,
                    title: trans.title,
                    summary: trans.summary,
                };
            });
        } else {
            proj.translations = proj.translations.map((trans) => {
                return {
                    language: trans.languages_code.code,
                    title: trans.title,
                    summary: trans.summary,
                    description: trans.description,
                };
            });
        }

        // returning new object
        let proj_obj = (({
            id,
            status,
            date_updated,
            start_date,
            end_date,
            project_status,
            team_size,
            is_internal,
            data,
            type,
            language,
            Organizations,
            Projects_Outputs,
            Podcast,
            People,
            Posts,
            translations,
        }) => ({
            id,
            status,
            date_updated,
            start_date,
            end_date,
            project_status,
            team_size,
            is_internal,
            data,
            type,
            language,
            Organizations,
            Projects_Outputs,
            Podcast,
            People,
            Posts,
            translations,
        }))(proj);

        new_objs.push(proj_obj);
    });
    let meta = {
        last_published: Date.now(),
        last_updated: new Date(
            Math.max(...new_objs.map((e) => new Date(e.date_updated)))
        ),

        n: new_objs.length,
    };

      return {
        meta: meta,
        projects: new_objs,
      };
}

in a previous attempt i also wrote a graphql query. maybe that's useful:

query Project {
    Projects(filter: { status: { _in: ["published", "published_anon"] } }) {
        id
        status
        date_updated
        project_status
        start_date
        end_date_predicted
        end_date
        is_internal
        team_size
        data
        type 
        language
        Podcast {
            id
            soundcloud_link
            title
            description
        }

        Projects_Outputs {
            url
            output_type
            is_public
            translations {
                description
                languages_code {
                    code
                }
            }
        }
        Organizations {
            Organizations_id {
                id
                short_id
                legal_form
                sector

                translations {
                    languages_code {
                        code
                    }
                    website
                    name
                    description
                }
            }
        }

        translations {
            title
            description
            summary
            languages_code {
                code
            }
        }
        Local_Chapters {
            Local_Chapters_id{
                id
                short_id
                founded
            }

        }
    }
}

KonradUdoHannes commented 10 months ago

I've merged some functionality into production that so far only focusses on the project cards on the project overview page. The new functionality can be seen when scrolling through the production page as there is a single new project card with an "Anonymous Organization", is created by the one anonymous published project that we currently have set up.

Project "slug" pages are not implemented yet, but I don't expect this to be too much trouble. And its probably easier to discuss once we have concrete examples for this as well.

Also projects linked on the LC pages directly don't use anonymization yet (but they also don't fetch anonymous projects).

@friep regarding #470, I'm not sure the issues are too closely related. the website only parses the information that it displays, so some data does not need to be anonymized because its never extracted to begin with. Nonetheless, if there is anything I'm not aware of where output/design from this issue can help with #470, I'm happy to discuss it further.

friep commented 10 months ago

from what i understand, this lgtm. in the future ( #231 ) , we'd need the sector and legal_form of the organization even if anonymized. however i think this is something that can be added later on when working on this.

the slug is actually interesting because so far, they were the project ids (e.g. 2020-03-ERL), with the last three letters typically having some sort of relation to the organization (in the example, ERLassjahr). sometimes this would be very apparent, e.g. if it was Arbeiterwohlfahrt -> AWO. this is why i have left them out so far. but I think i'll just give out new IDs to those projects that focus more on the content, not the org

LCs not fetching those projects yet is ok for me.

from my POV we can close this issue.

KonradUdoHannes commented 10 months ago

Ok then lets close this for now. I'm also in favor of parsing and potentially anonymizing additional fields once they become relevant, i.e. when implementing #231 for instance.

CorrelAid / correlaid_website