bene-ges / nemo_compatible

useful things that work with NVIDIA NeMo library
Apache License 2.0
9 stars 1 forks source link

Wiki article download #7

Closed thomaschhh closed 9 months ago

thomaschhh commented 9 months ago

Since we need the Wiki articles to replicate your steps, I suggest to adapt https://github.com/bene-ges/nemo_compatible/blob/2b1ca5934d57256006a0a9f66c467587ba07df05/scripts/nlp/en_spellmapper/dataset_preparation/preprocess_yago.sh#L34

to

WIKIPEDIA_FOLDER=./yago_wikipedia
mkdir $WIKIPEDIA_FOLDER 

awk 'BEGIN {FS="\t"; print "#!/usr/bin/env bash"} {print "wget \"https://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&titles=" $1 "&redirects=true&format=json&explaintext=1&exsectionformat=plain\" -O \"'"$WIKIPEDIA_FOLDER"'" $2 ".txt\"\nsleep 0.1"}' < yago.uniq2 > run_wget.sh
bash ./run_wget.sh

based on what is needed later on https://github.com/bene-ges/nemo_compatible/blob/2b1ca5934d57256006a0a9f66c467587ba07df05/scripts/nlp/en_spellmapper/dataset_preparation/build_training_data.sh#L20

bene-ges commented 9 months ago

If it works you can make a pull request, I will accept it

thomaschhh commented 9 months ago

Fixed in #8