dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

question: adding ids to word vectors #300

Closed diogocamacho closed 5 years ago

diogocamacho commented 5 years ago

hi dmitriy, i have a question on your glove implementation in text2vec that is more of a feature question than anything. i am using glove to build a model for DNA strings of a certain length. i have 244k of these sequences, and the overarching idea is to use glove to learn which features of each sequence, if i break each one of them into n-grams of size L, are more like other n-grams. so, a single seq_id can have multiple n-grams.

when i run the algorithm the way you have on one of your posts, i get a cosine similarity but each name of the n-grams similar to my query one is the actual n-gram name. what i'd like to see would be something like the name being id_n-gram. i tried passing the ids of the sequences in the iterator step, but that did not solve my problem. anyway i can access the sequence ids that correspond to a given n-gram easily?

thanks!

dselivanov commented 5 years ago

Hi. Sorry, I'm not sure I've totally understood the question.

Could you please give and example (not code, but rather couple of simplified inputs and outputs):

diogocamacho commented 5 years ago

hi dmitriy,

in thinking more about it i realized it was an ill-formed question. for what i want, i think the obvious answer is to:

1) generate n-grams 2) train glove model 3) query model with some sequence of interest using the same vectorizer that i use in building the glove model 4) post factum, look at the occurences of given n-grams with high similarity to the query sequence ngrams and extract relevant metadata about these.

i think this will address the question. but, to clarify: the whole idea i had was to use a glove model to make predictions of similarity between gene sequences or genomic sequences. so, you'd generate a model with a large set of genes where you'd establish some n-gram length, and then, for a sequence you haven't seen before, make a prediction on similarity, and, thereby, make some inference into sequence evolution or the like. i think the kind of model that glove produces would be very good for that.

thanks again.

best, d.


Diogo Camacho

“For hypotheses should be employed only in explaining the properties of things, but not assumed in determining them, unless so far as they may furnish experiments." — Renee Descartes

-----Original Message-----
From: Dmitriy Selivanov <notifications@github.com>
Reply-To: dselivanov/text2vec <reply@reply.github.com>
Date: Monday, April 1, 2019 at 1:45 PM
To: dselivanov/text2vec <text2vec@noreply.github.com>
Cc: Diogo Camacho <diogo.camacho.2008@gmail.com>, Author <author@noreply.github.com>
Subject: Re: [dselivanov/text2vec] question: adding ids to word vectors (#300)

Hi. Sorry, I'm not sure I've totally understood the question.
Could you please give and example (not code, but rather couple of simplified inputs and outputs):

* input sequence
* what you are getting now
* what you would like to get

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <https://github.com/dselivanov/text2vec/issues/300#issuecomment-478676748>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALfuXOsnzxBrkNaFj5e36R-aZaihz1uVks5vckXSgaJpZM4cUxuC>.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/dselivanov/text2vec","title":"dselivanov/text2vec","subtitle":"GitHub repository","main_image_url":"https://github.githubassets.com/images/email/message_cards/header.png","avatar_image_url":"https://github.githubassets.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/dselivanov/text2vec"}},"updates":{"snippets":[{"icon":"PERSON","message":"@dselivanov in #300: Hi. Sorry, I'm not sure I've totally understood the question. \r\n\r\nCould you please give and example (not code, but rather couple of simplified inputs and outputs):\r\n- input sequence\r\n- what you are getting now\r\n- what you would like to get"}],"action":{"name":"View Issue","url":"https://github.com/dselivanov/text2vec/issues/300#issuecomment-478676748"}}}[
{
"@context": "http://schema.org",
"@type": "EmailMessage",
"potentialAction": {
"@type": "ViewAction",
"target": "https://github.com/dselivanov/text2vec/issues/300#issuecomment-478676748",
"url": "https://github.com/dselivanov/text2vec/issues/300#issuecomment-478676748",
"name": "View Issue"
},
"description": "View this Issue on GitHub",
"publisher": {
"@type": "Organization",
"name": "GitHub",
"url": "https://github.com"
}
}
]