TZstatsADS / Spr2017-proj4-team6

Spr2017-proj4-team6 created by GitHub Classroom
1 stars 0 forks source link

Load and process the data #2

Open galen211 opened 7 years ago

galen211 commented 7 years ago

For each record in the dataset, there are some information we want to extract and store them in a regular form: canonical author id, coauthors, paper title, publication venue title. You may need to find regular matched in the input string vectors by using regex in R. Here is a tutorial for regular expression in R, which might help you https://rstudio-pubs-static.s3.amazonaws.com/74603_76cd14d5983f47408fdf0b323550b846.html

amandazhang commented 7 years ago

Hey @galen211 , I've finished this part and uploaded 14 data files under 'output' folder. Although I already I checked all the datasets I generated, it's still good to have a double check to see if there's any problem with them.

galen211 commented 7 years ago

Just checked @amandazhang. It looks good to me. Thanks!

amandazhang commented 7 years ago

Follow up: Now the output data looks exactly the same as shown in wk10 tutorial, image which means I deleted column named Author$PaperID, thus in the main.rmd file, there should be a minor change at image where the ids is replaced by ids = nrow(AKumar) @galen211

amandazhang commented 7 years ago

New issue spotted: image As is shown in the image, S Y Cheung and Shun Yan Cheung should all refer to the same person, same for R Hull and Richard Hull. Maybe a better way to represent authors and coauthors is to use their first initial and last name? @VirgileACM @galen211

galen211 commented 7 years ago

Actually, I think for the TF-IDF, the best thing to do is collapse their name into one string without spaces.

On Apr 7, 2017, at 10:37 AM, Qingyuan Zhang notifications@github.com wrote:

New issue spotted: https://cloud.githubusercontent.com/assets/10666254/24804930/06ca5e12-1b7e-11e7-816e-f60c816fddfd.png As is shown in the image, S Y Cheung and Shun Yan Cheung should all refer to the same person, maybe a better way to represent authors and coauthors is to use their initials?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TZstatsADS/Spr2017-proj4-team6/issues/2#issuecomment-292554358, or mute the thread https://github.com/notifications/unsubscribe-auth/AEJj81a3Xw5eKlxYCBSqxffjukEDyybEks5rtko2gaJpZM4MwIHF.

amandazhang commented 7 years ago

Data part is done!