Pinafore / qb

QANTA Quiz Bowl AI
MIT License
167 stars 50 forks source link

Storing Updated Wikipedia Redirects and Category Links in S3 #86

Closed Nonameentered closed 3 years ago

Nonameentered commented 3 years ago

@nhatsmrt and I have generated updated files for wikipedia redirect mappings and category links, as well as an updated parsed-wiki tar (and files). What's the best way for us to get them updated in the S3 bucket?

The missing files causing #85 should also be addressed by adding this updated parsed-wiki info as well.

EntilZha commented 3 years ago

For the moment, lets store this on umiacs object store since both of you should already have access to that. The only other perhaps non-obvious things to do are:

  1. Download all the other wikipedia files for that dump date (in the past, its bitten us not to have grabbed everything)
  2. Store all the original wikipedia files and the outputs in a date versioned path, e.g., umiacs-root:2017-04-17/stuff
nhatsmrt commented 3 years ago

@EntilZha By all wikipedia files, you meant all the files in this site right? That looks massive...

EntilZha commented 3 years ago

These are the main files

> aws s3 ls s3://pinafore-us-west-2/qanta-jmlr-datasets/wikipedia/
2018-11-07 13:53:06   78078721 enwiki-20180420-all-titles-in-ns0.gz
2018-11-07 13:53:17  248402506 enwiki-20180420-all-titles.gz
2018-11-07 13:54:51   22725972 enwiki-20180420-category.sql.gz
2018-11-07 14:03:37 2384041757 enwiki-20180420-categorylinks.sql.gz
2018-11-07 13:55:03 3161777934 enwiki-20180420-externallinks.sql.gz
2018-11-07 13:56:07   35054023 enwiki-20180420-geo_tags.sql.gz
2018-11-07 13:56:18 1627693553 enwiki-20180420-page.sql.gz
2018-11-07 13:56:36  240350728 enwiki-20180420-page_props.sql.gz
2018-11-07 13:56:47 5848856395 enwiki-20180420-pagelinks.sql.gz
2018-11-07 13:59:19 15921773645 enwiki-20180420-pages-articles-multistream.xml.bz2
2018-11-07 13:57:36  122311739 enwiki-20180420-redirect.sql.gz
2018-11-07 17:52:47 4477266026 parsed-wiki.tar.gz
2018-09-10 21:17:10     633267 vital_articles.json
2018-12-02 23:29:36  483844605 wiki_lookup.2018.04.18.json
2018-11-07 14:21:11  483844605 wiki_lookup.json
2018-12-02 23:35:52  124472056 wikipedia-titles.2018.04.18.json

So shouldn't need the files like this: enwiki-20210301-pages-articles-multistream1.xml-p1p41242.bz2 or enwiki-20210301-pages-meta-history1.xml-p1p873.7z or enwiki-20210301-pages-meta-history1.xml-p1p873.bz2