Closed furkmak closed 4 months ago
Hi Furkan @furkmak , here is some update about the repo Party Classifier with Unique ID, hope this can help: In this repo, it seems like the user needs to download these files from the drive:
party_clf_pdid_mnb.joblib
(the training weight data, too large to upload to GitHub, it is located at /content/drive/Shareddrives/Delta Lab/github/party_classifier_pdid /party_clf_pdid_mnb.joblib
) fb_2020_140m_adid_text_clean.csv.gz
,(The code indicates this data should be located under repo fb2020)fb_2020_140m_adid_var1.csv.gz
(The code indicates this data should be located under repo fb2020)fb_2022_adid_text_clean.csv.gz
(The code indicates this data should be located under repo fb2022)fb_2022_adid_var1.csv.gz
(The code indicates this data should be located under repo fb2022)Thank you @atlasharry. Could you also put these into the spreadsheet? You can create a new row for each script and list the mentions. You can follow my build and ask if you have any questions.
Thank you @atlasharry. Could you also put these into the spreadsheet? You can create a new row for each script and list the mentions. You can follow my build and ask if you have any questions.
Yes! I will follow the template and fill it in.
Here is the list of repos we will work on:
Hi Harry@atlasharry,
You may replace the Facebook 2022 and Google 2022 repos with the repos below:
Thank you!
Hi Harry@atlasharry,
You may replace the Facebook 2022 and Google 2022 repos with the repos below:
- [ ] image-video-data-preparation
- [ ] Automatic speech recognition
- [ ] AWS-rekognition-image-video-processing
- [ ] Data post-production
Thank you!
Got it! Thanks!
Currently, with the fb_2022's and google_2022's decomposing into the new repo data-post-production, some files that previously relied on fb_2022 and google_2022 don't have a new valid directory path in their new repo data-post-production. I added a parent path data-post-production/
before each of those files to identify them. For example, the line in the code: path_input_data <- "../google_2022/google2022_adidlevel_text.csv
will now be path_input_data <- "../data-post-production/google_2022/google2022_adidlevel_text.csv
.
Later, when all the files are successfully transferred and the figshare files are available, we can change the file paths accordingly
Currently, with the fb_2022's and google_2022's decomposing into the new repo data-post-production, some files that previously relied on fb_2022 and google_2022 don't have a new valid directory path in their new repo data-post-production. I added a parent path
data-post-production/
before each of those files to identify them. For example, the line in the code:path_input_data <- "../google_2022/google2022_adidlevel_text.csv
will now bepath_input_data <- "../data-post-production/google_2022/google2022_adidlevel_text.csv
.Later, when all the files are successfully transferred and the figshare files are available, we can change the file paths accordingly
Thanks Harry, good point! But regarding making changes to filepaths, I replied in a group chat with Harry and Aleks after they reached out: The users will have different filepaths than us, unless they git clone all our repos --> even then, some files such as ../google_2022/google2022_adidlevel_text.csv
will be on figshare/Google Drive and not in the specified github repo. Therefore, even after we changed the directory into data-post-production
, I would still consider it as our local paths that won't apply to the users.
Alternatively, I suggest we add a comment above the data import lines indicating where the tables come from (their upstream repos and scripts) and remove the (local) directory information.
In addition to Harry's changes, as we discussed in the meeting yesterday I will add a tab to the Google Drive Mentions sheet that lays out old file paths/names and their updated ones, where applicable. It's been very confusing trying to update the sheet due to the combination of outdated/updated files/filepaths being used in repos and scripts, so the idea is to have a reference that can be checked when this confusion occurs.
This issue is to help us track all Google Drive mentions in scripts and Readme files so that we can replace them later with their Figshare equivalent. Here is a link to a spreadsheet that we can use to track: https://docs.google.com/spreadsheets/d/1blCmGz7mBOIrch1gL4z0xMcEGlZHOsdc09PmgLW9qZg/edit?usp=sharing