Closed SebastianZimmeck closed 1 year ago
@IneshV, which repos have you audited so far? Are those all good to go? It would be great if you could maintain a list here in an issue comment with links to the repo and whether they are good to go.
@markusneumann, @jyao01, @pvo70, could you document your repos and post the links here when they are ready for @IneshV 's review?
I have been looking at issue_classifier, ABSA, race_of_focus, party_classifier, party_classifier_ldid, ad_tone, ad_goal_classifier , entity_linking, and entity_linking_2022. I think I may have rushed through them and I would like to go through each one again before giving the green light. I have sent a few questions to Markus and I should be able to finalize these repos by Friday.
Thanks, @IneshV!
Please update this comment on a regular basis to keep a list of the reviewed repos and their current status.
As we discussed, please ask repo-specific questions in issues of the respective repos and broader questions relating to multiple repos here.
Hi I just finished looking at ABSA.
In the inference folder: https://github.com/Wesleyan-Media-Project/ABSA/tree/main/inference, I was able to run 01_prepare_fb_2022.R and 01_prepare_fb_2020.R but 01_prepare_fb_2020.R was not working for me. The error came from loading the file on line 11 but I think the path was right. I thought his might be because the file is in ABSA's data folder is actually called "intermediate_separate_generic_absa_training_data.rdata" but it still was not able to load. I haven't used the load function frequently in R so I'm not sure what the error was. Because this file was not working, I was unable to run 02_rf_140m.py.
In was unable to load anything in the train folder. In line 10 of 01_prepare_separate_generic_absa.R: https://github.com/Wesleyan-Media-Project/ABSA/blob/main/train/01_prepare_separate_generic_absa.R It calls a file not in the fb_2022 folder. The second script in this folder worked well.
@IneshV face_url_scraper_2022 is ready for you to review. Let me know if you encounter any problems. Thanks!
@ Inesh first, you should always run the training before the inference.
On the inference, please provide the error message, otherwise I can't know what the error is. I ran this yesterday on a computer I had just freshly cloned the repo to, and it worked for me.
On the training, that file is on the Google Drive because it is large
Got it, I'll check it again and let you know if I get an error message.
In line 17 of race_of_focus_140m.R, it references file ../entity_linking/facebook/data/entity_linking_results_140m_notext_combined.csv.gz which is not on GitHub or drive.
Also in the race_of_focus_google_2020.R, Line 18 states. “path_wmpent <- “../datasets/wmp_entity_files/Google/2020/wmp_google_entities_v040521.dta”. This file was not on github and I could not find this file in the datasets folder in drive.
In line 17 of race_of_focus_140m.R, it references file ../entity_linking/facebook/data/entity_linking_results_140m_notext_combined.csv.gz which is not on GitHub or drive.
Also in the race_of_focus_google_2020.R, Line 18 states. “path_wmpent <- “../datasets/wmp_entity_files/Google/2020/wmp_google_entities_v040521.dta”. This file was not on github and I could not find this file in the datasets folder in drive.
Both of those files are, and already were, on Github. Did you pull the most recent version?
I have run party_classifier this week. I haven't encountered any issues running the code, and everything in this repo is well organized.
requirements_py.txt
requirements_r.txt
Some codes require files in Delta Lab's google drive to run the code:
fb_2020_140m_adid_text_clean.csv.gz
fb_2020_140m_adid_var1.csv.gz
I have looked over face_scrapper_2022 and checked that all file dependencies:
Warnings: (Doesn't halt code, but would like to point out)
08_face_url_final_selection.ipynb
, 02_ballotpedia_scaper_house.ipynb
, 06_sitting_senator_scrapers.ipynb
01_ballotpedia_scaper_senate.ipynb
, 03_ballotpedia_scaper_gov.ipynb
, 06_sitting_senator_scrapers.ipynb
03_ballotpedia_scaper_gov.ipynb
Questions:
02_ballotpedia_scaper_house_cleanup.ipynb
small documentation mistake (Load the data cleaned by by RA Jasmine => Load the data cleaned by RA Jasmine)?03_ballotpedia_scaper_gov.ipynb
only uses ./"file name")Issue_classifier auditing https://github.com/Wesleyan-Media-Project/issue_classifier
In line 15 of 11_fb18.R it calls load("data/all_features_fb.RData") but this file is not in the data folder In line 12 of 13_combine_18_20, it says fb20 <- fread("data/fbel_prepared_issues.csv", encoding = "UTF-8") but this file does not exist
In line 35 of 32_issue_clf_inference_binary_rf.py, No such file or directory: 'models/binary_rf_v1/issues_rf_ISSUE10.joblib
In line 15 of 52_abortion_inflation_keyword_vs_randomforest.py, No such file or directory: 'data/tv_asr_keyword_clf.csv'
In line 16 of 53_issue_clf_inference_binary_rf_abortion_inflation.py, No such file or directory: 'data/inference/fb2022.csv.gz'
In inference_multilabel_trf_v1.ipynb did not install torch and No such file or directory: 'models/multilabel_trf_v1/pytorch_model.bin'
R packages: data.table haven dplyr stringr tidyr stringi
Py packages: sklearn pandas numpy joblib
fb_2022 Audit I was able to run most files smoothly but there were a few files missing. Here are my notes:
In cell 2 of 01_prepare_media_data/01_gen_data_file_info.ipynb, it reads video_path = '/data/1/wesmediafowler/projects/AdMedia/FB/video' But this folder is not on the drive
In 01_prepare_media_data/02_prepare_vid_asr_extract_audio_b03202023.py, it reads path_mp4 = "../data/video/b03202023/mp4_uni_c/" path_wav = "../data/video/b03202023/wav_uni_c/" But these are not on the drive
In 01_prepare_media_data/04_prepare_vid_face_trim_general.py says os.chdir("../data/video/general/mp4_uni") But this file does not exist on drive
In 02/master_files/fb2022_master_09052022_11082022 I got the error ‘ Failed to connect to database: Error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2) ‘ I am not sure if this is a personal issue or a code issue.
In 03_process_asr/02_clean_asr_fb2022_b03202023.ipynb, It has to call the files path_mp4 = '/home/jyao01/github/fb_2022/data/video/b03202023/mp4_uni_c/' path_wav = '/home/jyao01/github/fb_2022/data/video/b03202023/wav_uni_c/' But these are not on the googld drive
In 03_process_asr/03_assemble_asr_09052022_11082022.ipynb In cell 16 it reads: asr = pd.read_csv("../../datasets/facebook/asr_all/result_asr_checksum_fb2022_03202023.csv")
But this file is not on github and the gdrive folder is empty https://drive.google.com/drive/u/0/folders/11IqHiUJh8Oe4obJBlG9_TfDhIKKTH3hk
In cell 13 of 04_process_aws/ocr_label/01_aws_vid_ocr_fb2022_09052022_11082022.ipynb It reads ocr = pd.read_csv("../../../datasets/facebook/ocr_all/result_ocr_video_checksum_fb2022_03222023.csv") But this folder is empty in the gdrive: https://drive.google.com/drive/u/0/folders/1RBn_EemG1jKRBdDs9mMzfTHyMeeE88WF
Cell 2 of 04_process_aws/ocr_label/01_aws_vid_ocr_fb2022_cleanup.ipynb Reads r = pd.read_csv("./output_combined_fb2022_b03222023_text/ocr_vid_fb2022_b03222023.csv") But this file is not on the drive
I have looked over google_2022 and checked file dependencies:
Notes:
File: 01_gen_file_info.ipynb missing -> /data/1/wesmediafowler/projects/AdMedia/Google/video , google2022_video_info.csv , /data/1/wesmediafowler/projects/AdMedia/Google/image
File: 02_masterfile_01062021_11082022.ipynb-> ../data/master/g2022 , g2022_archive_01062021_11082022.csv
I have looked through fb_2020 repo.
1: For the file '01_fb_2020_118m_adid_text.ipynb' cell [14], the code reads the file 'pacss_ads_080421.csv' from path '../datasets/master_files/pacss_ads_080421.csv'. There are no folders 'datasets' and 'master_files' in this repo. To run the code, manually create a folder datasets/master_files, download file 'pacss_ads_080421.csv' from Google Drive, and put it in that folder.
2: To run the file '02_fb_2020_118m_adid_var.ipynb'
P.S. This file is around 4GB, and it takes about 42 seconds to run the line:
archive = pd.read_csv("../datasets/master_files/fb_2020_archive_asr.csv")
with i7-11800H CPU and 16GB RAM
Download 'fb_2020_aws.csv' from the Drive and put it in the path (or create the path) '../datasets/aws_results/fb_2020_aws.csv'
If you encounter FileNotFoundError in adding the entity linking result part, try to change the code from
el = pd.read_csv("../entity_linking/facebook/data/entity_linking_results_118m_v3_500_notext_combined.csv.gz")
to el = pd.read_csv("../../entity_linking/facebook/data/entity_linking_results_118m_v3_500_notext_combined.csv.gz")
if you encounter FileNotFoundError in adding the ad goal part, try to change the code from
goal = pd.read_csv("../ad_goal_classifier/data/ad_goal_rf_fb_118m.csv.gz", compression='gzip')
to goal = pd.read_csv("../../ad_goal_classifier/data/ad_goal_rf_fb_118m.csv.gz", compression='gzip')
Download the file 'party_all_fb_118m.csv' from the drive. If still encounter FileNotFoundError, try to change the code from
pclf = pd.read_csv('../party_classifier/data/facebook/party_all_fb_118m.csv')
to pclf = pd.read_csv('../../party_classifier/data/facebook/party_all_fb_118m.csv')
Download the file 'party_classifier_pdid' from the drive and create a path: '../party_classifier_pdid/party_clf_entity_fb_118m.csv'
If you encounter FileNotFoundError in adding WMP party_all, try to change the code from wmp = pd.read_csv("../datasets/wmp_entity_files/Facebook/2020/wmp_fb_entities_v090622.csv", usecols=['pd_id', 'party_all'])
to wmp = pd.read_csv("../../datasets/wmp_entity_files/Facebook/2020/wmp_fb_entities_v090622.csv", usecols=['pd_id', 'party_all'])
If you encounter FileNotFoundError in adding Ad_tone, try to change the code
from tone_c = pd.read_csv('../ad_tone/data/ad_tone_constructed_fb118m.csv.gz',compression='gzip')
to tone_c = pd.read_csv('../../ad_tone/data/ad_tone_constructed_fb118m.csv.gz',compression='gzip')
from tone_m = pd.read_csv('../ad_tone/data/ad_tone_mentionbased_fb118m.csv')
to tone_m = pd.read_csv('../../ad_tone/data/ad_tone_mentionbased_fb118m.csv')
The file 'fb_2020_attacklike.csv' is missing from the repo attack_like". There is no such file from Google Drive as well. When I try searching for this file locally on my computer through the software EVERYTHING, I found it in the recycle bin. Therefore, I think this file may exist in attack_like's earlier branch. It is possible that the new file for 'fb_2020_attacklike.csv' should be 'attacklike_fb2020_118m.csv'.
Maybe you can run this cell by modifying the code from
attacklike = pd.read_csv("../attack_like/fb_2020_attacklike.csv")
toattacklike = pd.read_csv("../../attack_like/attacklike_fb2020_118m.csv")
Download the file 'race_of_focus.csv' from Google Drive and put it in the path: 'race_of_focus/data/race_of_focus.csv"'
The file "AllCands_092022.csv" is missing, and could not locate it anywhere
3: I have added all the wiki links to all the variables in the readme. Some variables have no wiki so I leave them unlinked.
I looked through the repo "fb_ad_scraper." The code "fb_ad_media_scrape.py" requires the libraries like "sqlalchemy", "selenium," "pandas," etc. And they can be installed through the "pip install" command. I encountered some unknown issues during the installation, which took me a long time to figure out. It seems that the issues are due to the fact that I had my VPN servers active. The issue disappeared after closing the VPN server and using my home internet. But the download speed was significantly slow(I think this might be because I am in China). The average speed was only about 5 KB/s, and it took around 2 hours to download the libraries. This is a significant deviation from the typical installation duration one would expect.
The script needs access to an SQL database and other databases like BigQuery (Like line 42, db_connection = create_engine(db_connection_str).connect(),
). So I am not able to run the entire code.
But reading through all the code, I found the code is well organized. Furthermore, the Readme is also clear and comprehensive.
The progress is kept in a Google Sheet. This is the authoritative truth source.
The repos that we currently have are audited.
[UPDATED]Reopening repository to add comments for the auditing of fb_2022 repo: Notes: needs these repositories in order to run- race of focus, ad tone, ad goal classifier, party classifier pdid, datasets, entity linking 2022 Also needed Google drive downloaded to computer with this path: "/content/drive/Shareddrives/Delta Lab/github/...". it has needed files in /fb_2022/data and in /fb_2022
Below I have outlined missing files with their paths in each section of the fb_2022 folders:
01- "/datasets/facebook/asr_all/result_asr_checksum_fb2022.csv", "/datasets/facebook/ocr_vid_all/result_ocr_video_checksum_fb2022.csv" "/04_process_aws/ocr_label/ocr_fb2022_b03222023_master.csv", "/data/image/img_fb2022_general_master.csv",, "/data/image/img_fb2020_general_uni_upload_aws.sh", "04_process_aws/face_recognition/vid/face_vid_fb2022_general_master.csv", "/data/image/img_fb2022_general_master.csv"
03- "/datasets/facebook/asr_all/result_asr_checksum_fb2022.csv" , "/datasets/facebook/asr_all/result_asr_checksum_fb2022_03202023.csv" , "/home/jyao01/github/fb_2022/03_process_asr/result_asr_fb2022_09052022_11082022.csv"
04- "ocr_vid_fb2022_b03222023.csv" "output_combined_fb2022_b03222023_text/ocr_vid_fb2022_b03222023.csv" "datasets/facebook/ocr_all/result_ocr_video_checksum_fb2022.csv" "/datasets/facebook/ocr_all/result_ocr_video_checksum_fb2022_03222023.csv", "/04_process_aws/result_ocr_vid_fb2022_09052022_11082022.csv"
05- "/04_process_aws/face_recognition_ocr_img/vid_face/result_face_vid_fb2022_general.csv", "/data/image/img_fb2022_general_master.csv"
06- "similarity_lsh_fb2022.csv"
This are the files in the drive needed in the repo
@pvo70, will look into finding these files.
Closing this issue as we have #9 now (let's go away from these omnibus issues in the future and open more specific issues).
I will close this and open this issue in fb_2022.
@IneshV will audit repos' readmes. Let's add the links to the repos here.