Development on post-processor framework

kstapelfeldt commented 4 years ago

Given the output files of the domain-crawler & the anticipated output files of the twitter-crawler - how do we parse/transform the data into our output format (JSON/csv). This needs to be done in such a way that we can continue to add rules or modifications to the framework as needed to address things like filtering non-news or homepage content.

kstapelfeldt commented 4 years ago

@RaiyanRahman will provide sample output from the domain crawler based on what he already has @danhuacai will work with the twitter output to start.

kstapelfeldt commented 4 years ago

Notes on a matching algorithm -

Extract all possible citations from all the articles/tweets (Text Aliases, Twitter Handles, and Domain names)
Compare citation list against the scope crawl data.
When citations appear in scope, we create the referring ID link. When things don't appear in scope, they are stored in a list and ranked by popularity.

kstapelfeldt commented 4 years ago

Amy can match the URLs and determine who has cited whom.
Still need to be able to match using text aliases (will need to do some type of regex or other pattern matching within article text)
Still need to do extraction of twitter handles
Still need to create the extra JSON/csv output that contains top twitter handles and domain names not included in the scope (we added a new sheet to .csv)
Still need to see how this will operate given Danhua's twitter crawler output (a sample of which exists in pull request https://github.com/UTMediaCAT/mediacat-twitter-crawler/pull/1) @danhuacai
push code for @danhuacai

kstapelfeldt commented 4 years ago

regex done for text aliases and twitter handles for domain crawler

Still to do

https://github.com/UTMediaCAT/mediacat-docs/issues/5 @danhuacai - DONE
extraction of twitter handles using the pattern that begins with at symbol. - DONE
Cross-matching all references between twitter and domain output data <-- Try developing against dummy data - STARTED
need to create the extra JSON/csv output that contains top twitter handles and domain names not included in the scope

danhuacai commented 4 years ago

https://drive.google.com/file/d/1bzsWzckV03JtGM7QT1WO6fhNA2X6jDsq/view?usp=sharing

one big twitter output file

kstapelfeldt commented 4 years ago

@danhuacai has added a column for URL and added a pull request

kstapelfeldt commented 4 years ago

TODO

Cross-matching all references between twitter and domain output data <-- Try developing against dummy data - Progress made. Bugs remain.
need to create the extra JSON/csv output that contains top twitter handles and domain names not included in the scope

amygao9 commented 4 years ago

Went through a tutorial with @danhuacai on the post-processing, she will go through the code and understand it first. Then we will split up tasks later.

kstapelfeldt commented 4 years ago

TODO

need to create the extra JSON/csv output that contains top twitter handles and domain names not included in the scope
Modify to handle "small JSON" file example provided by @RaiyanRahman

danhuacai commented 3 years ago

https://databricks.com/glossary/pyspark

Pyspark might be helpful for huge amount of data after we get the data from post-processor.

kstapelfeldt commented 3 years ago

2 is done. 1 logic is there but needs real data to be tested. Right now final code is on post-processor branch. Will wait until we can test against crawled twitter data to confirm before pushing.

kstapelfeldt commented 3 years ago

@amygaoo will make a change to accept the small .csvs and then @danhuacai needs to test

kstapelfeldt commented 3 years ago

@amygaoo has passed code and @danhuacai is running in a virtual machine, but it's not finished yet. We will know more once it returns output. It has been running ~30 hours on a partial data set. @danhuacai needs to provide the number of files being processed in this time period, as well as how the VM is provisioned, so that we can benchmark approximately how long post-processing takes.

kstapelfeldt commented 3 years ago

We did not get this benchmarking done. Needs to be completed. Amy needs to add more logic to the 'interest output' to sort according to most links

kstapelfeldt commented 3 years ago

Added logic to interest output for sorting.
@amygaoo to test the post-processor on the whole twitter crawl output on the Compute Canada instance.

kstapelfeldt commented 3 years ago

@jacqueline-chan and @amygaoo met last night - trying to find mini .csvs and running post-processor. Not at today's meeting and so this process in pending.

kstapelfeldt commented 3 years ago

Amy ran the post-processor on 10 users - finished in 2 days. We encountered an issue in running the full output as we are still encountering poorly formed .csvs even after running through Danhua's mini processor. Currently, Amy ran this while skipping all malformed records (only about 20). Post-processor is still running. Gone through 600,000/5 million since Sat/Sunday.

@amygaoo will continue to try to find out where the errors in .csv creation are being introduced so we can resolve the issue.

kstapelfeldt commented 3 years ago

Last time she checked it was at 1,000,000/5,000,000 - took a week to run one million but then there was a connection issue. Problem with speed but also inability to pick up after process is terminated (through things like connection problems)

Suggestions:

Create a dictionary of all user handles/URLs so they can be tracked and restarting is possible? OR use database?
Split handles into 5 and run five simultaneously? Join dictionaries after the several processes? Other possibilities for multi-threading to improve speed. Look into https://dask.org.

Top priority: Make the process more robust (pick up after a break). Second priority: Make it faster.

amygao9 commented 3 years ago

TODO: All tweets by @Marianhouk get a hit if someone mentions @Marianhouk, so many of tweets by a twitter handle has the same amount of hits Proposed solution:

Include a json node in output for each twitter handle and domain, which will hold all referrals for that specific source
Each node for a specific tweet/article will only hold the referrals to the specific article/tweet.
Add a field in each node that specifies if its a source domain or if its one article/tweet.

kstapelfeldt commented 3 years ago

Top priority is to pick up after a break.
Then we will discuss the data model and use cases to find out where the data model might need to be altered.

jacqueline-chan commented 3 years ago

@amygaoo was able to catch errors and write the data that have been processed to multiple files for the reason of being able to pick up after it is stopped in the middle
currently working on picking back up after the process has been stopped

kstapelfeldt commented 3 years ago

Picking up after the process has been stopped is working (Yay!)
Now @amygaoo is working on the refactor.
Right now it's possible it's possible that a source has multiple text aliases and associated twitter handles.
Currently multiple text aliases are associated with a domain. These will be grouped together and separated by pipe in the output.
Currently multiple twitter handles are associated with a domain. These will be separate.
Amy has to create new nodes for the text aliases, twitter handles, and domains (everything that is a new article or a tweet).

kstapelfeldt commented 3 years ago

@amygaoo completed refactor for modified output. Tested on small test data and it looks like it works. Text aliases are in list format. in-code documentation complete. Right now, for newly created nodes

Rules

if the node has type 'domain' 'twitter article' or 'text alias' and has no referrals, it's not included in the output
Sometimes a domain name will appear twice. For example, if it is in the scope, but also crawled. In this instance, we would have the same data, but one would be marked as type 'domain' and one would be of type 'article.' In this case, we keep the 'article' and discard the 'domain.' If a homepage is not crawled, there will be a 'domain' record, which is kept in the output.

how to run

Documentation is in-code - But @amygaoo will add more documentation re: files and folders expected by the script.

kstapelfeldt commented 3 years ago

Modifying the post-processor framework to operate more quickly (optimize)

kstapelfeldt commented 3 years ago

John has tried several approaches to optimization and discussed with KS and Nat - the first approach is the best one. 2 days to 4 hours.
John spent the week looking at the alternatives and verified his original approach was the correct one to take.
New problem: Run out of memory at the very end and didn't process the twitter crawler data and needs to re-run the crawler. Working on an approach to monitor size and write to disk if the process is at risk, and is now seeking to re-run.
John is having a problem with Graham cloud that he is trying to address (he keeps getting kicked out). He will write Alejandro who will write the Compute Canada stuff.

kstapelfeldt commented 3 years ago

Close to complete, but need to test.

johnguirgis commented 3 years ago

Made to modifications to periodically remove duplicates from referrals list while executing rather than at the end to save memory Ran successfully on smaller scope, currently running with full scope

UTMediaCAT / mediacat-backend