arjunjauhari / quora-ques

1 stars 1 forks source link

Get Started on the project. #1

Open arjunjauhari opened 6 years ago

arjunjauhari commented 6 years ago

1) Download dataset. 2) Data exploration 3) Setup repository structure

naik-amey commented 6 years ago

image

naik-amey commented 6 years ago

https://www.kaggle.com/anokas/data-analysis-xgboost-starter-0-35460-lb/code

Summary: Total number of question pairs for training: 404290 Duplicate pairs: 36.92% Total number of questions in the training data: 537933 Number of questions that appear multiple times: 111780 Total number of question pairs for testing: 2345796

It is also worth pointing out that the actual number of test rows are likely to be much lower than 2.3 million. According to the data page, most of the rows in the test set are using auto-generated questions to pad out the dataset, and deter any hand-labeling. This means that the true number of rows that are scored could be very low.

For my FYI:

  1. from nltk.corpus import stopwords stops = set(stopwords.words("english")) https://www.geeksforgeeks.org/removing-stop-words-nltk-python/