Open kstapelfeldt opened 4 years ago
@RaiyanRahman will provide sample output from the domain crawler based on what he already has @danhuacai will work with the twitter output to start.
Notes on a matching algorithm -
https://drive.google.com/file/d/1bzsWzckV03JtGM7QT1WO6fhNA2X6jDsq/view?usp=sharing
one big twitter output file
@danhuacai has added a column for URL and added a pull request
Went through a tutorial with @danhuacai on the post-processing, she will go through the code and understand it first. Then we will split up tasks later.
https://databricks.com/glossary/pyspark
Pyspark might be helpful for huge amount of data after we get the data from post-processor.
2 is done. 1 logic is there but needs real data to be tested. Right now final code is on post-processor branch. Will wait until we can test against crawled twitter data to confirm before pushing.
@amygaoo will make a change to accept the small .csvs and then @danhuacai needs to test
@amygaoo has passed code and @danhuacai is running in a virtual machine, but it's not finished yet. We will know more once it returns output. It has been running ~30 hours on a partial data set. @danhuacai needs to provide the number of files being processed in this time period, as well as how the VM is provisioned, so that we can benchmark approximately how long post-processing takes.
We did not get this benchmarking done. Needs to be completed. Amy needs to add more logic to the 'interest output' to sort according to most links
@jacqueline-chan and @amygaoo met last night - trying to find mini .csvs and running post-processor. Not at today's meeting and so this process in pending.
Amy ran the post-processor on 10 users - finished in 2 days. We encountered an issue in running the full output as we are still encountering poorly formed .csvs even after running through Danhua's mini processor. Currently, Amy ran this while skipping all malformed records (only about 20). Post-processor is still running. Gone through 600,000/5 million since Sat/Sunday.
@amygaoo will continue to try to find out where the errors in .csv creation are being introduced so we can resolve the issue.
Last time she checked it was at 1,000,000/5,000,000 - took a week to run one million but then there was a connection issue. Problem with speed but also inability to pick up after process is terminated (through things like connection problems)
Suggestions:
Top priority: Make the process more robust (pick up after a break). Second priority: Make it faster.
TODO: All tweets by @Marianhouk get a hit if someone mentions @Marianhouk, so many of tweets by a twitter handle has the same amount of hits Proposed solution:
@amygaoo completed refactor for modified output. Tested on small test data and it looks like it works. Text aliases are in list format. in-code documentation complete. Right now, for newly created nodes
Modifying the post-processor framework to operate more quickly (optimize)
Close to complete, but need to test.
Made to modifications to periodically remove duplicates from referrals list while executing rather than at the end to save memory Ran successfully on smaller scope, currently running with full scope
Given the output files of the domain-crawler & the anticipated output files of the twitter-crawler - how do we parse/transform the data into our output format (JSON/csv). This needs to be done in such a way that we can continue to add rules or modifications to the framework as needed to address things like filtering non-news or homepage content.