Bread-and-Code / Text-Summarization

A text document will be provided and it'll produce it's summary
MIT License
29 stars 7 forks source link

Dimensionality Reduction problem #2

Open Samsomyajit opened 4 years ago

Samsomyajit commented 4 years ago

@Samsomyajit(Somyajit Chakraborty) As Sayan da(@sayanmondal2098 ) has reported me earlier that Dimensionality Reduction will play a huge(important) role in our creation of perfectly modeled Text Summarizer, so I’ve been doing some research and I might have come across a solution. [Though further discussion on the topic seems necessary].

Irrespective of our two approach theories to our project of creating the Text Summarizer - Abstractive/Extractive, Dimensionality Reduction is a problem on its own. Many Papers I’ve been reading suggests different solutions to the problem in all total. Which if not explained and dealt with properly might cause a pause to our project.

So, firstly, let us understand the nature of our problem -

Since, text files from which we will create our summarizer, is a high dimension (complex) file it will pose a serious problem to visualize our dataset, and if we can’t visualize the data then we cannot determine which algorithm or even which approach to use for creating our summarizer. So what we need is a Dimensionality Reduction Algorithm(Process) to reduce the complex dimension of our dataset to 2-D/3-D dataset so that we can visualize it in perfect plots.

What we already have and the problem:

Standard dimensionality reduction methods such as principal component analysis (PCA), locally linear embedding (LLE) (Roweis and Saul, 2000), or t-distributed stochastic neighbor embedding (t-SNE) (van der Maaten and Hinton, 2008) taken as input a set of feature vectors such as bag of words or tf vectors. An obvious drawback is that such methods ignore the textual nature of documents and instead consider the vocabulary words v1, . . . , vn as abstract orthogonal dimensions.

What we can do:

We focus on the following type of non-Euclidean geometry where the distance between document x and y is defined as dT (x, y) = sqrt(x − y)⊤T(x − y) …. (1) , Here T ∈ R n×n

is a symmetric positive semidefinite matrix, and we assume that documents x, y are represented as term-frequency (tf) column vectors. Since T can always be written as H⊤H for some matrix H∈R(n×n) , an equivalent but sometimes more intuitive interpretation of (1) is to compose the mapping x 7→ Hx with the Euclidean geometry dT (x, y) = dI (Hx, Hy) = ||kHx − Hyk2|| …. (2)

We can view T as encoding the semantic similarity between pairs of words and H as smoothing the tf vector by mapping observed words to related but unobserved words. Therefore, the geometry realized by (1) or (2) may be used to derive novel dimensionality reduction methods that are customized to text in general and to specific text domains in particular. The main challenge is to obtain the matrices H or T that describe the relationship amongst vocabulary words appropriately.

[To, skip all the above math, we can use un-supervised learning to solve our problem since our data mostly will be unlabelled]

Or, as an alternative to our above approach we can use Clustering (another non-Euclidean) method to create cluster matrices labelled “positive sentiment words”, “negative sentiment words” and “objective words”, along with their intensities and reviews. And then use t-SNE or LLE.

[But, this method will create its own problem, though nothing that we can’t manage...]

Now in total there are five tools/ methods we can use to do all the above things:

Manual Specification (using unsupervised algos to the diffusion matrix which then acts as a frequency counter, all of these we do to an unlabelled dataset). Contextual Diffusion (based on word similarities and weights) Use Google(Web) n-Grams -- modified contextual diffusion. Use WordNet (using hierarchical labelled word clusters, hypernyms,synonyms….)

What I think we can actually do:

We can first create a counter crawler using BigQuery to size our dataset. Then after getting the size we use(simulate) dummy algorithms of all the above methods on a part of that dataset and predict each possible complexity and go with the best possible option. The main job on the whole proposal is to create the dummy algos/ and decide how much of the part of dataset we have to choose so that the simulation run will be fruitful…

So what I am proposing is use Divide-and-Conquer Algo to create a Test run and use its result to go for the Final Run… ………………………………………………………………………………………………………………. Reference Reading:

Dimensionality Reduction for Text using Domain Knowledge → Click Here Spaceland Embedding of Sparse Stochastic Graphs → Click Here Semantic distance in WordNet: → https://drive.google.com/drive/folders/1SR66I4XYKoyDj3gXYyB0YQODBdm4Q3MR Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy → https://drive.google.com/drive/folders/1SR66I4XYKoyDj3gXYyB0YQODBdm4Q3MR Mao, Yi & Dillon, Joshua & Lebanon, Guy. (2007). Sequential Document Visualization. IEEE transactions on visualization and computer graphics. 13. 1208-15. 10.1109/TVCG.2007.70592.

codecakes commented 4 years ago

Hi @Samsomyajit Ive dabbled with creating tf idf ftom ngrams and done some work on prediction using svm.

What pre requisites reading materials would one need to further work on this issue?

Samsomyajit commented 4 years ago

Hi @codecakes
I would like to know what kind of further work you have on mind (nature), and do you want research papers or articles on tf-idf ngrams or svms ?

codecakes commented 4 years ago

Hi @Samsomyajit

We can first create a counter crawler using BigQuery to size our dataset.

what is the counter crawler you are talking about. Is this a word counter? Do you want to create something like tfidfvectorizer?

Then after getting the size we use(simulate) dummy algorithms of all the above methods on a part of that dataset and predict each possible complexity and go with the best possible option.

Can you point me to the resources for these algorithms, input data, expected output data?

Samsomyajit commented 4 years ago

@codecakes Label Count Algorithm for Web Crawler-Label Count it is similar in structure but a bit different instead of web labels or key words we are actually counting similar words for weight matrix (it was just an idea).. we are not actually going forward with the above approach...