Closed Nonameentered closed 3 years ago
For the moment, lets store this on umiacs object store since both of you should already have access to that. The only other perhaps non-obvious things to do are:
umiacs-root:2017-04-17/stuff
@EntilZha By all wikipedia files, you meant all the files in this site right? That looks massive...
These are the main files
> aws s3 ls s3://pinafore-us-west-2/qanta-jmlr-datasets/wikipedia/
2018-11-07 13:53:06 78078721 enwiki-20180420-all-titles-in-ns0.gz
2018-11-07 13:53:17 248402506 enwiki-20180420-all-titles.gz
2018-11-07 13:54:51 22725972 enwiki-20180420-category.sql.gz
2018-11-07 14:03:37 2384041757 enwiki-20180420-categorylinks.sql.gz
2018-11-07 13:55:03 3161777934 enwiki-20180420-externallinks.sql.gz
2018-11-07 13:56:07 35054023 enwiki-20180420-geo_tags.sql.gz
2018-11-07 13:56:18 1627693553 enwiki-20180420-page.sql.gz
2018-11-07 13:56:36 240350728 enwiki-20180420-page_props.sql.gz
2018-11-07 13:56:47 5848856395 enwiki-20180420-pagelinks.sql.gz
2018-11-07 13:59:19 15921773645 enwiki-20180420-pages-articles-multistream.xml.bz2
2018-11-07 13:57:36 122311739 enwiki-20180420-redirect.sql.gz
2018-11-07 17:52:47 4477266026 parsed-wiki.tar.gz
2018-09-10 21:17:10 633267 vital_articles.json
2018-12-02 23:29:36 483844605 wiki_lookup.2018.04.18.json
2018-11-07 14:21:11 483844605 wiki_lookup.json
2018-12-02 23:35:52 124472056 wikipedia-titles.2018.04.18.json
So shouldn't need the files like this: enwiki-20210301-pages-articles-multistream1.xml-p1p41242.bz2 or enwiki-20210301-pages-meta-history1.xml-p1p873.7z or enwiki-20210301-pages-meta-history1.xml-p1p873.bz2
@nhatsmrt and I have generated updated files for wikipedia redirect mappings and category links, as well as an updated parsed-wiki tar (and files). What's the best way for us to get them updated in the S3 bucket?
The missing files causing #85 should also be addressed by adding this updated parsed-wiki info as well.