Wesleyan-Media-Project / creative_overview

An overview of all repos belonging to the CREATIVE project
Other
0 stars 0 forks source link

[IMPORTANT] Final checks of creative tracking: to be completed #16

Closed Meiqingx closed 4 months ago

Meiqingx commented 9 months ago

Final checks to be completed for CREATIVE deliverables:

For all repos listed on the Creative Tracking Sheet:

  1. A) Check if any 2020 data files and scripts with _118m in the filenames exist, for example, 04_prepare_118m.R in ad_goal_classifier repo (These files are related to an outdated version of 2020 datasets -- with size N=118m -- or scripts that use these datasets. The 2020 dataset was eventually updated to _140m, with size N=140m --> so if you encounter any 2020 datasets and scripts, only the _140m versions are the right ones); B) if they exist, remove the data/scripts altogether from our repos.

  2. Based on Google Drive Mentions sheet, for scripts that import any _118m data files: 1) if the script is only intended for producing a _118 table or model, then we remove the entire script from the repo. 2) if the it is not, bring it to us (the work-coordination Slack channel and we can discuss it case-by-case)

  3. Clean up mentions of "Google Drive" hosted files: Based on Google Drive Mentions sheet (Input Files tab), for any input files previously hosted on Google Drive, as long as they are not a final output table (the eight tables - "text", "var1", "var" and "cid" for 2022 FB/Google - listed in data-post-production), go to the lines in the scripts where they were imported and add a comment to indicate the upstream repo of this dataset (where this input file was produced).

    • Use this simply language in comment: "output from [insert upstream repo name]" on top of the data import line
    • I have already done this for the data-post-production repo, so you may skip this one.
  4. Check if any mentions/links to restricted (internal) docs/sheets/files existed (such as a link to this slide deck). If they do, remove all. If not sure, ask us.

  5. Update the new pipeline diagram into the README in each repo once they are completed

  6. Edit the Objective section in README to align with the updated pipeline diagram: Previous four categories in the Objective section:

    • Data Collection
    • Data Storage & Processing
    • Preliminary Data Classification
    • Final Data Classification

New categories (minor differences):

Hopefully these could be done together after the new pipeline diagram is generated, if nothing is pressing. Documenting it here. Let me know if you have any questions!

SebastianZimmeck commented 4 months ago

@Meiqingx, is there anything remaining to be done here? If not, let's close this issue.

Meiqingx commented 4 months ago

Yes it's completed. I'm closing it!