PeterStieg / feb25_bds_classification-of-rakuten-e-commerce-products

3 stars 0 forks source link

Data Exploration & Data Visualization // Pre vs Post text cleaning #PostDefense, #NiceToHave #53

Open Wilsbert12 opened 2 months ago

Wilsbert12 commented 2 months ago

I would like to show before / after effects of text cleaning:

Graph per category type (primary category, sub category)

Since the question of cleaning html code etc. derives from some of the analysis of the text, I would suggest the following overall structure:

  1. Setup / Data Exploration
  2. Data Viz
  3. Text Clean
  4. Before / after text clean comparison
Wilsbert12 commented 2 months ago

@PeterStieg @thomas-borer makes sense?

PeterStieg commented 2 months ago

@Wilsbert12, please note that currently we are not...

  1. either cutting off lengthy description at same point via _[:charlimit]
  2. or excluding outliers
Wilsbert12 commented 2 months ago

@PeterStieg I know. But the outliers might be due to "junk" text (very long empty spaces, html code, repetitions etc.). I just thought if we are cleaning up the text, we might also show what we have archived.