Open don-han opened 8 years ago
@chewisinho What da ya think? preprocess
takes a lot of time even for 300 pages, and I think it might be better to not rerun urls that have been already processed?
I think it's fine for now. Here are some things to consider:
-l
flag is pretty fast.I might have to a more detailed time complexity analysis, but I think the majority of the time is spent on jusText than NLTK last time I timed it, but I might be mistaken, so I will try a more systematic way of measuring times. -l definitely is faster, but it's because it doesn't go through the clean
function which does both html cleaning and NER.
In other words, do not preprocess if the link has been already preprocessed. (Reduce time)