BIDS-projects / etl

Extraction, transformation, and loading for topic modeling
0 stars 0 forks source link

Implement a flag to only preprocess links that has not been processed #9

Open don-han opened 8 years ago

don-han commented 8 years ago

In other words, do not preprocess if the link has been already preprocessed. (Reduce time)

don-han commented 8 years ago

@chewisinho What da ya think? preprocess takes a lot of time even for 300 pages, and I think it might be better to not rerun urls that have been already processed?

chewisinho commented 8 years ago

I think it's fine for now. Here are some things to consider:

  1. We still have more filtering to do to get rid of "javascript enabled spambots", so that means we have to run the preprocessor from scratch again anyway.
  2. The majority of the time is probably spent doing natural language processing (based on my experience). Correct me if I'm wrong, but I think running with the -l flag is pretty fast.
  3. In what use cases would the flag save a lot of time? Answer: Only when we have already processed a lot of data and obtain a new batch of data (because this only avoids repeat processing steps). There are easy ways to mitigate this: suppose we run preprocessor on 1000 websites and it's saved to the filtered collection, and then we receive 1000 more websites. If we just start fresh with a new collection (so the HTML collection only contains the 1000 new websites) and we write to the old filtered collection (so the data is still aggregated into one database), then it produces the same effect as a flag.
  4. Actually I just thought of a good way to implement the flag. Let me know (maybe at the meeting) if you think this is important, and I can just make the flag really quickly.
don-han commented 8 years ago

I might have to a more detailed time complexity analysis, but I think the majority of the time is spent on jusText than NLTK last time I timed it, but I might be mistaken, so I will try a more systematic way of measuring times. -l definitely is faster, but it's because it doesn't go through the clean function which does both html cleaning and NER.