Wesleyan-Media-Project / creative_overview

An overview of all repos belonging to the CREATIVE project
Other
0 stars 0 forks source link

Audit repos #4

Closed SebastianZimmeck closed 1 year ago

SebastianZimmeck commented 1 year ago

@IneshV will audit repos' readmes. Let's add the links to the repos here.

SebastianZimmeck commented 1 year ago

@IneshV, which repos have you audited so far? Are those all good to go? It would be great if you could maintain a list here in an issue comment with links to the repo and whether they are good to go.

@markusneumann, @jyao01, @pvo70, could you document your repos and post the links here when they are ready for @IneshV 's review?

IneshV commented 1 year ago

I have been looking at issue_classifier, ABSA, race_of_focus, party_classifier, party_classifier_ldid, ad_tone, ad_goal_classifier
, entity_linking, and entity_linking_2022. I think I may have rushed through them and I would like to go through each one again before giving the green light. I have sent a few questions to Markus and I should be able to finalize these repos by Friday.

SebastianZimmeck commented 1 year ago

Thanks, @IneshV!

Please update this comment on a regular basis to keep a list of the reviewed repos and their current status.

As we discussed, please ask repo-specific questions in issues of the respective repos and broader questions relating to multiple repos here.

IneshV commented 1 year ago

Hi I just finished looking at ABSA.

In the inference folder: https://github.com/Wesleyan-Media-Project/ABSA/tree/main/inference, I was able to run 01_prepare_fb_2022.R and 01_prepare_fb_2020.R but 01_prepare_fb_2020.R was not working for me. The error came from loading the file on line 11 but I think the path was right. I thought his might be because the file is in ABSA's data folder is actually called "intermediate_separate_generic_absa_training_data.rdata" but it still was not able to load. I haven't used the load function frequently in R so I'm not sure what the error was. Because this file was not working, I was unable to run 02_rf_140m.py.

In was unable to load anything in the train folder. In line 10 of 01_prepare_separate_generic_absa.R: https://github.com/Wesleyan-Media-Project/ABSA/blob/main/train/01_prepare_separate_generic_absa.R It calls a file not in the fb_2022 folder. The second script in this folder worked well.

jyao01 commented 1 year ago

@IneshV face_url_scraper_2022 is ready for you to review. Let me know if you encounter any problems. Thanks!

markusneumann commented 1 year ago

@ Inesh first, you should always run the training before the inference.

On the inference, please provide the error message, otherwise I can't know what the error is. I ran this yesterday on a computer I had just freshly cloned the repo to, and it worked for me.

On the training, that file is on the Google Drive because it is large

IneshV commented 1 year ago

Got it, I'll check it again and let you know if I get an error message.

IneshV commented 1 year ago

In line 17 of race_of_focus_140m.R, it references file ../entity_linking/facebook/data/entity_linking_results_140m_notext_combined.csv.gz which is not on GitHub or drive.

Also in the race_of_focus_google_2020.R, Line 18 states. “path_wmpent <- “../datasets/wmp_entity_files/Google/2020/wmp_google_entities_v040521.dta”. This file was not on github and I could not find this file in the datasets folder in drive.

markusneumann commented 1 year ago

In line 17 of race_of_focus_140m.R, it references file ../entity_linking/facebook/data/entity_linking_results_140m_notext_combined.csv.gz which is not on GitHub or drive.

Also in the race_of_focus_google_2020.R, Line 18 states. “path_wmpent <- “../datasets/wmp_entity_files/Google/2020/wmp_google_entities_v040521.dta”. This file was not on github and I could not find this file in the datasets folder in drive.

Both of those files are, and already were, on Github. Did you pull the most recent version?

atlasharry commented 1 year ago

I have run party_classifier this week. I haven't encountered any issues running the code, and everything in this repo is well organized.

requirements_py.txt

requirements_r.txt

candace-walker commented 1 year ago

I have looked over face_scrapper_2022 and checked that all file dependencies:

Warnings: (Doesn't halt code, but would like to point out)

Questions:

IneshV commented 1 year ago

Issue_classifier auditing https://github.com/Wesleyan-Media-Project/issue_classifier

In line 15 of 11_fb18.R it calls load("data/all_features_fb.RData") but this file is not in the data folder In line 12 of 13_combine_18_20, it says fb20 <- fread("data/fbel_prepared_issues.csv", encoding = "UTF-8") but this file does not exist

In line 35 of 32_issue_clf_inference_binary_rf.py, No such file or directory: 'models/binary_rf_v1/issues_rf_ISSUE10.joblib

In line 15 of 52_abortion_inflation_keyword_vs_randomforest.py, No such file or directory: 'data/tv_asr_keyword_clf.csv'

In line 16 of 53_issue_clf_inference_binary_rf_abortion_inflation.py, No such file or directory: 'data/inference/fb2022.csv.gz'

In inference_multilabel_trf_v1.ipynb did not install torch and No such file or directory: 'models/multilabel_trf_v1/pytorch_model.bin'

R packages: data.table haven dplyr stringr tidyr stringi

Py packages: sklearn pandas numpy joblib

IneshV commented 1 year ago

fb_2022 Audit I was able to run most files smoothly but there were a few files missing. Here are my notes:

In cell 2 of 01_prepare_media_data/01_gen_data_file_info.ipynb, it reads video_path = '/data/1/wesmediafowler/projects/AdMedia/FB/video' But this folder is not on the drive

In 01_prepare_media_data/02_prepare_vid_asr_extract_audio_b03202023.py, it reads path_mp4 = "../data/video/b03202023/mp4_uni_c/" path_wav = "../data/video/b03202023/wav_uni_c/" But these are not on the drive

In 01_prepare_media_data/04_prepare_vid_face_trim_general.py says os.chdir("../data/video/general/mp4_uni") But this file does not exist on drive

In 02/master_files/fb2022_master_09052022_11082022 I got the error ‘ Failed to connect to database: Error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2) ‘ I am not sure if this is a personal issue or a code issue.

In 03_process_asr/02_clean_asr_fb2022_b03202023.ipynb, It has to call the files path_mp4 = '/home/jyao01/github/fb_2022/data/video/b03202023/mp4_uni_c/' path_wav = '/home/jyao01/github/fb_2022/data/video/b03202023/wav_uni_c/' But these are not on the googld drive

In 03_process_asr/03_assemble_asr_09052022_11082022.ipynb In cell 16 it reads: asr = pd.read_csv("../../datasets/facebook/asr_all/result_asr_checksum_fb2022_03202023.csv")

But this file is not on github and the gdrive folder is empty https://drive.google.com/drive/u/0/folders/11IqHiUJh8Oe4obJBlG9_TfDhIKKTH3hk

In cell 13 of 04_process_aws/ocr_label/01_aws_vid_ocr_fb2022_09052022_11082022.ipynb It reads ocr = pd.read_csv("../../../datasets/facebook/ocr_all/result_ocr_video_checksum_fb2022_03222023.csv") But this folder is empty in the gdrive: https://drive.google.com/drive/u/0/folders/1RBn_EemG1jKRBdDs9mMzfTHyMeeE88WF

Cell 2 of 04_process_aws/ocr_label/01_aws_vid_ocr_fb2022_cleanup.ipynb Reads r = pd.read_csv("./output_combined_fb2022_b03222023_text/ocr_vid_fb2022_b03222023.csv") But this file is not on the drive

candace-walker commented 1 year ago

I have looked over google_2022 and checked file dependencies:

Notes:

File: 01_gen_file_info.ipynb missing -> /data/1/wesmediafowler/projects/AdMedia/Google/video , google2022_video_info.csv , /data/1/wesmediafowler/projects/AdMedia/Google/image

File: 02_masterfile_01062021_11082022.ipynb-> ../data/master/g2022 , g2022_archive_01062021_11082022.csv

atlasharry commented 1 year ago

I have looked through fb_2020 repo.

1: For the file '01_fb_2020_118m_adid_text.ipynb' cell [14], the code reads the file 'pacss_ads_080421.csv' from path '../datasets/master_files/pacss_ads_080421.csv'. There are no folders 'datasets' and 'master_files' in this repo. To run the code, manually create a folder datasets/master_files, download file 'pacss_ads_080421.csv' from Google Drive, and put it in that folder.

2: To run the file '02_fb_2020_118m_adid_var.ipynb'

P.S. This file is around 4GB, and it takes about 42 seconds to run the line: archive = pd.read_csv("../datasets/master_files/fb_2020_archive_asr.csv") with i7-11800H CPU and 16GB RAM

3: I have added all the wiki links to all the variables in the readme. Some variables have no wiki so I leave them unlinked.

atlasharry commented 1 year ago

I looked through the repo "fb_ad_scraper." The code "fb_ad_media_scrape.py" requires the libraries like "sqlalchemy", "selenium," "pandas," etc. And they can be installed through the "pip install" command. I encountered some unknown issues during the installation, which took me a long time to figure out. It seems that the issues are due to the fact that I had my VPN servers active. The issue disappeared after closing the VPN server and using my home internet. But the download speed was significantly slow(I think this might be because I am in China). The average speed was only about 5 KB/s, and it took around 2 hours to download the libraries. This is a significant deviation from the typical installation duration one would expect.

The script needs access to an SQL database and other databases like BigQuery (Like line 42, db_connection = create_engine(db_connection_str).connect(),). So I am not able to run the entire code. But reading through all the code, I found the code is well organized. Furthermore, the Readme is also clear and comprehensive.

SebastianZimmeck commented 1 year ago

The progress is kept in a Google Sheet. This is the authoritative truth source.

SebastianZimmeck commented 1 year ago

The repos that we currently have are audited.

candace-walker commented 1 year ago

[UPDATED]Reopening repository to add comments for the auditing of fb_2022 repo: Notes: needs these repositories in order to run- race of focus, ad tone, ad goal classifier, party classifier pdid, datasets, entity linking 2022 Also needed Google drive downloaded to computer with this path: "/content/drive/Shareddrives/Delta Lab/github/...". it has needed files in /fb_2022/data and in /fb_2022

Below I have outlined missing files with their paths in each section of the fb_2022 folders:

01- "/datasets/facebook/asr_all/result_asr_checksum_fb2022.csv", "/datasets/facebook/ocr_vid_all/result_ocr_video_checksum_fb2022.csv" "/04_process_aws/ocr_label/ocr_fb2022_b03222023_master.csv", "/data/image/img_fb2022_general_master.csv",, "/data/image/img_fb2020_general_uni_upload_aws.sh", "04_process_aws/face_recognition/vid/face_vid_fb2022_general_master.csv", "/data/image/img_fb2022_general_master.csv"

03- "/datasets/facebook/asr_all/result_asr_checksum_fb2022.csv" , "/datasets/facebook/asr_all/result_asr_checksum_fb2022_03202023.csv" , "/home/jyao01/github/fb_2022/03_process_asr/result_asr_fb2022_09052022_11082022.csv"

04- "ocr_vid_fb2022_b03222023.csv" "output_combined_fb2022_b03222023_text/ocr_vid_fb2022_b03222023.csv" "datasets/facebook/ocr_all/result_ocr_video_checksum_fb2022.csv" "/datasets/facebook/ocr_all/result_ocr_video_checksum_fb2022_03222023.csv", "/04_process_aws/result_ocr_vid_fb2022_09052022_11082022.csv"

05- "/04_process_aws/face_recognition_ocr_img/vid_face/result_face_vid_fb2022_general.csv", "/data/image/img_fb2022_general_master.csv"

06- "similarity_lsh_fb2022.csv"

candace-walker commented 1 year ago

Screen Shot 2023-09-18 at 2 58 04 PM Screen Shot 2023-09-18 at 2 58 18 PM

This are the files in the drive needed in the repo

SebastianZimmeck commented 1 year ago

@pvo70, will look into finding these files.

SebastianZimmeck commented 1 year ago

Closing this issue as we have #9 now (let's go away from these omnibus issues in the future and open more specific issues).

candace-walker commented 1 year ago

I will close this and open this issue in fb_2022.