Open glynnbird opened 5 years ago
https://wiki.dbpedia.org/develop/datasets/dbpedia-dataset-2019-08-30-pre-release
^ get a list of entities and add it to our dictionary that powers the anagram engine.
Stats from this exercise
15.2 million articles in the data dump We boiled it down to 8.3 million.
We could try and figure out the most popular pages from here https://dumps.wikimedia.org/other/pageviews/2020/2020-01/
and only use those
CREATE TABLE stats (page VARCHAR(255) PRIMARY KEY,views INTEGER NOT NULL);
INSERT INTO stats (page,views) VALUES ('Taylor Swift', 56) ON CONFLICT (page) DO UPDATE SET views=views+56;
TO DO
Put pageviews code on windermere run it for yesterday's date automatically put it in a cron that combines the output of the above with combined.txt and de-dupes (sort -u) commits the new file to git and deploys it (which then builds the anagram dictionary again)
We want to save each file with a predictable name so that we can then do another project of analysing files and finding new entrants.
We need to write all this stuff up!
Do we have enough data in our anagrams database?
Solution: ADAM GILCHRIST