attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.74k stars 965 forks source link

divided up the text into summary, and contnt for NLP processing #249

Open ertosns opened 3 years ago

ertosns commented 3 years ago

i divide up the text into two parts summary, and content, this can help in NLP processing specifically for summarizing transformers, for example the output of wikiextractor can be used to train wiki-summary https://github.com/ertosns/wiki-summary

attardi commented 3 years ago

Isn't

summary = [i for i in page]

the same as:

summary = page
ertosns commented 3 years ago

yes it is of course, but in case of numpy it will be a reference to it, i see in this case page is just a python default list, so you are absolutely right, perhaps i thought it was numpy! i will fix it now.

ertosns commented 3 years ago

also the output need to be pruned a bit, for example to add the option to fit certain criteria, for example some output is too long, or too short. i will work on that soon.