Open galen211 opened 7 years ago
Hey @galen211 , I've finished this part and uploaded 14 data files under 'output' folder. Although I already I checked all the datasets I generated, it's still good to have a double check to see if there's any problem with them.
Just checked @amandazhang. It looks good to me. Thanks!
Follow up: Now the output data looks exactly the same as shown in wk10 tutorial,
which means I deleted column named Author$PaperID, thus in the main.rmd file, there should be a minor change at
where the ids is replaced by ids = nrow(AKumar)
@galen211
New issue spotted: As is shown in the image, S Y Cheung and Shun Yan Cheung should all refer to the same person, same for R Hull and Richard Hull. Maybe a better way to represent authors and coauthors is to use their first initial and last name? @VirgileACM @galen211
Actually, I think for the TF-IDF, the best thing to do is collapse their name into one string without spaces.
On Apr 7, 2017, at 10:37 AM, Qingyuan Zhang notifications@github.com wrote:
New issue spotted: https://cloud.githubusercontent.com/assets/10666254/24804930/06ca5e12-1b7e-11e7-816e-f60c816fddfd.png As is shown in the image, S Y Cheung and Shun Yan Cheung should all refer to the same person, maybe a better way to represent authors and coauthors is to use their initials?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TZstatsADS/Spr2017-proj4-team6/issues/2#issuecomment-292554358, or mute the thread https://github.com/notifications/unsubscribe-auth/AEJj81a3Xw5eKlxYCBSqxffjukEDyybEks5rtko2gaJpZM4MwIHF.
Data part is done!
For each record in the dataset, there are some information we want to extract and store them in a regular form: canonical author id, coauthors, paper title, publication venue title. You may need to find regular matched in the input string vectors by using regex in R. Here is a tutorial for regular expression in R, which might help you https://rstudio-pubs-static.s3.amazonaws.com/74603_76cd14d5983f47408fdf0b323550b846.html