A) Check if any 2020 data files and scripts with _118m in the filenames exist, for example, 04_prepare_118m.R in ad_goal_classifier repo (These files are related to an outdated version of 2020 datasets -- with size N=118m -- or scripts that use these datasets. The 2020 dataset was eventually updated to _140m, with size N=140m --> so if you encounter any 2020 datasets and scripts, only the _140m versions are the right ones); B) if they exist, remove the data/scripts altogether from our repos.
Based on Google Drive Mentions sheet, for scripts that import any _118m data files: 1) if the script is only intended for producing a _118 table or model, then we remove the entire script from the repo. 2) if the it is not, bring it to us (the work-coordination Slack channel and we can discuss it case-by-case)
Clean up mentions of "Google Drive" hosted files: Based on Google Drive Mentions sheet (Input Files tab), for any input files previously hosted on Google Drive, as long as they are not a final output table (the eight tables - "text", "var1", "var" and "cid" for 2022 FB/Google - listed in data-post-production), go to the lines in the scripts where they were imported and add a comment to indicate the upstream repo of this dataset (where this input file was produced).
Use this simply language in comment: "output from [insert upstream repo name]" on top of the data import line
I have already done this for the data-post-production repo, so you may skip this one.
Check if any mentions/links to restricted (internal) docs/sheets/files existed (such as a link to this slide deck). If they do, remove all. If not sure, ask us.
Update the new pipeline diagram into the README in each repo once they are completed
Edit the Objective section in README to align with the updated pipeline diagram:
Previous four categories in the Objective section:
Data Collection
Data Storage & Processing
Preliminary Data Classification
Final Data Classification
New categories (minor differences):
Data Collection
Data Processing
Data Classification
Compiled Final Data
Hopefully these could be done together after the new pipeline diagram is generated, if nothing is pressing. Documenting it here. Let me know if you have any questions!
Final checks to be completed for CREATIVE deliverables:
For all repos listed on the Creative Tracking Sheet:
A) Check if any 2020 data files and scripts with
_118m
in the filenames exist, for example,04_prepare_118m.R
inad_goal_classifier
repo (These files are related to an outdated version of 2020 datasets -- with size N=118m -- or scripts that use these datasets. The 2020 dataset was eventually updated to_140m
, with size N=140m --> so if you encounter any 2020 datasets and scripts, only the_140m
versions are the right ones); B) if they exist, remove the data/scripts altogether from our repos.Based on Google Drive Mentions sheet, for scripts that import any
_118m
data files: 1) if the script is only intended for producing a_118
table or model, then we remove the entire script from the repo. 2) if the it is not, bring it to us (the work-coordination Slack channel and we can discuss it case-by-case)Clean up mentions of "Google Drive" hosted files: Based on Google Drive Mentions sheet (
Input Files
tab), for any input files previously hosted on Google Drive, as long as they are not a final output table (the eight tables - "text", "var1", "var" and "cid" for 2022 FB/Google - listed in data-post-production), go to the lines in the scripts where they were imported and add a comment to indicate the upstream repo of this dataset (where this input file was produced).data-post-production
repo, so you may skip this one.Check if any mentions/links to restricted (internal) docs/sheets/files existed (such as a link to this slide deck). If they do, remove all. If not sure, ask us.
Update the new pipeline diagram into the README in each repo once they are completed
Edit the Objective section in README to align with the updated pipeline diagram: Previous four categories in the Objective section:
New categories (minor differences):
Hopefully these could be done together after the new pipeline diagram is generated, if nothing is pressing. Documenting it here. Let me know if you have any questions!