This repository illustrates the task of applying Machine Translation for Product Categorization of an E-Commerce Website data (Flipkart), classification of the description of products into the primary categry of their category tree, and the documenting the path to an optimal model pipeline
Clone the repo
git clone https://github.com/Priyanshiguptaaa/Flipkart_Product_Categorization/-.git
cd Flipkart_Product_Categorization/
** Note: The Code is Implemented in Google Colaboratory that lets us build the project without installing it locally. Installation of some libraries may take some time depending on your internet connection and system properties. You can download the Colab Notebook as a Jupyter Notebook and Run it Locally or on the Google Colab Platform as well
You can download the E-Commerce Dataset sample from here
The following steps are performed:
**Note: Alternate approach to cleaning can be using BeautifulSoup and Selenium to scrape the product category from the website using the Product URL
Feature Name | Type | Description |
---|---|---|
Description | STR | The description of the Product (Primary Feature) |
Product_Category_Tree | STR | Used to Extract the Primary Category |
Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.
For product categorization task, the conventional methods are based on machine learning classification algorithms, but this paper : - ("Don’t Classify, Translate: Multi-Level E-Commerce Product Categorization Via Machine Translation". Maggie Yundi Li, Liling Tan, Stanley Kok. 2018. https://arxiv.org/pdf/1812.05774.pdf ) has proposed a new paradigm based on machine translation and it has shown that this approach achieves better predictiion accuracy than the classification system. Here we have implemented the proposed model.
To get started, upload and open the notebooks in playground mode and run the cells(You must be logged in with your google account and provide additional authorization). If you want to run locally, a requirements.txt file is provided
git clone https://github.com/Priyanshiguptaaa/Flipkart_Product_Categorization/-.git
cd Flipkart_Product_Categorization/
pip install -r requirements.txt
Model Name | Accuracy |
---|---|
Seq2Seq + Attention + Teacher Forcing | 81% |
[1] "Unconstrained Product Categorization with Sequence-to-Sequence Models". Maggie Yundi Li, Liling Tan, Stanley Kok, Ewa Szymanska. 2018. https://www.comp.nus.edu.sg/~skok/papers/ecomdc18.pdf
[2] "Don’t Classify, Translate: Multi-Level E-Commerce Product Categorization Via Machine Translation". Maggie Yundi Li, Liling Tan, Stanley Kok. 2018. https://arxiv.org/pdf/1812.05774.pdf
[3] "Effective Approaches to Attention-based Neural Machine Translation". Minh-Thang Luong, Hieu Pham, Christopher D. Manning. 2017. https://arxiv.org/abs/1508.04025
[4] "Sequence to Sequence Learning with Neural Networks". Ilya Sutskever, Oriol Vinyals, Quoc V. Le. 2014. https://arxiv.org/abs/1409.3215
[5] "Neural Machine Translation by Jointly Learning to Align and Translate". Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. 2016. https://arxiv.org/abs/1409.0473
[6] "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation". Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio. 2014. https://arxiv.org/abs/1406.1078
[7] "A Neural Conversational Model". Oriol Vinyals, Quoc Le. 2015. https://arxiv.org/abs/1506.05869
[8] "Language Modelling for Handling Out-of-Vocabulary Words in Natural Language Processing". Shabeel Meemulla Kandi. 2018. https://www.researchgate.net/publication/335757797_Language_Modelling_for_Handling_Out-of-Vocabulary_Words_in_Natural_Language_Processing
[9] "Attention Is All You Need". Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. 2017. https://arxiv.org/abs/1706.03762
[10] "Pay Less Attention with Lightweight and Dynamic Convolutions". Felix Wu, Angela Fan, Alexei Baevski, Yann N. Dauphin, Michael Auli. 2019. https://arxiv.org/abs/1901.10430
[11] "Machine Translation Approaches: Issues and Challenges". M. D. Okpor. 2014. https://www.ijcsi.org/papers/IJCSI-11-5-2-159-165.pdf
[12] "Augmenting Neural Machine Translation with Knowledge Graphs". Diego Moussallem, Mihael Arčan, Axel-Cyrille Ngonga Ngomo, Paul Buitelaar. 2019. https://arxiv.org/pdf/1902.08816.pdf
[13] "Knowledge Graphs Enhanced Neural Machine Translation". Yang Zhao, Jiajun Zhang, Yu Zhou1, Chengqing Zong. 2020. https://www.ijcai.org/proceedings/2020/0559.pdf
[14] "Knowledge Graph Enhanced Neural Machine Translation via Multi-task Learning on Sub-entity Granularity". Yang Zhao, Lu Xiang, Junnan Zhu, Jiajun Zhang, Yu Zhou, Chengqing Zong. 2020. https://www.aclweb.org/anthology/2020.coling-main.397.pdf
[15] "Integrating Graph Contextualized Knowledge into Pre-trained Language Models". Bin He, Di Zhou, Jinghui Xiao, Xin Jiang, Qun Liu, Nicholas Jing Yuan, Tong Xu. 2019. https://arxiv.org/pdf/1912.00147.pdf
[16] "Visualizing Semantic Structures of Sequential Data by Learning Temporal Dependencies" . Kyoung-Woon On, Eun-Sol Kim, Yu-Jung Heo, Byoung-Tak Zhang
[17] "Everyone Likes Shopping! Multi-class Product Categorization for e-Commerce" Zornitsa Kozareva
[18] "Large-scale Multi-class and Hierarchical Product Categorization for an E-commerce Giant" Ali Cevahir, Koji Murakami
[19] "GRAPHSEQ2SEQ: Graph-sequence-2-sequence for neural machine translation"