allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.75k stars 2.25k forks source link

Tutorial on how to view embeddings in HDF5 data #1735

Closed radiantone closed 6 years ago

radiantone commented 6 years ago

Hi, I am new to AllenNLP. I used the comman line tool to create hdf5 embeddings file from sentences. Following the tutorial. But it stops there and I'm not sure how to view the actual text results. I see the HDF5 file contains HDF5 datasets. I want to see the text word embeddings.

Thank you!

schmmd commented 6 years ago

I should add this to our ELMo tutorial: https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md. You might find it easier to just run ELMo programatically, such as in iPython.

@radiantone would you find it helpful if the ELMo command supported other output formats? Which would you prefer?

radiantone commented 6 years ago

Yeah, I am trying to use it programmatically. You'll have to forgive my newness to elmo. I see in the tutorial this segment:

embeddings = elmo(character_ids)

embeddings['elmo_representations'] is length two list of tensors.
Each element contains one layer of ELMo representations with shape
(2, 3, 1024).
  2    - the batch size
  3    - the sequence length of the batch
  1024 - the length of each ELMo vector

What I am wanting to do is view the actual text embeddings found vs enumerate the numerical vectors. Perhaps a bit more detail on practical things you can do once you have the embeddings object.

Sorry if this request sounds a bit ignorant.

schmmd commented 6 years ago

I think the following would be easiest:

from allennlp.commands.elmo import ElmoEmbedder
elmo = ElmoEmbedder()
vectors = elmo.embed_sentence(["I", "ate", "an", "apple", "for", "breakfast"])

Now you have vectors[0], vectors[1], and vectors[2] that represent the ELMo vectors. Each vector is length 6, matching the length of the input sentence.

radiantone commented 6 years ago

Yeah. I can get the vectors but I'm curious how to display the human readable words in them.

matt-gardner commented 6 years ago

They are vectors that are representing the words that you passed in. If you embedded the words, you already have access to them.

radiantone commented 6 years ago

Ok. I guess I'm wanting a concrete example of what one can do with the vectors. The tutorial doesn't show anything after obtaining vectors.

So I get these vectors. What next? What can I do with them. What information do they contain that I dont already have with the raw sentences? I know they are numbers and represent words but that's not enough to understand what to use them for.

This is probably obvious to career data scientists but for the layman programmer this is not clear in the tutorial.

matt-gardner commented 6 years ago

The fundamental problem of using machine learning on text is deciding how to represent text as features. From the start of statistical NLP until just a few years ago, people wrote feature extractors by hand to represent individual pieces of text. A few years ago we discovered that simple word embeddings were really good feature extractors, and if you used those as raw input to a statistical model of language, instead of hand-written feature extractors, your model would perform much better. ELMo is the next step in feature extraction for text. Instead of getting a single vector for each word in isolation, you run a pre-trained feature extractor on an entire sentence. The resulting vectors are then used as input to some statistical model that tries to predict something about language.

radiantone commented 6 years ago

Thank you for that explanation. So I can take the the embedding output file I created and plug it right in to (for example) the demo apps that demonstrate machine comprehension etc?

matt-gardner commented 6 years ago

We've designed AllenNLP so that it's easy to use ELMo as a feature extractor for any existing model. You can see how to do that here. Basically, that uses ELMo as a TokenEmbedder, as described in this tutorial.

Arjunsankarlal commented 6 years ago

Hi, I was able to get the vectors using the method specified in this issue. Thanks for that! But my question is , there are three vectors generated for every sentence, what does it signify? For the same sentence three different vectors are given. So does the first vector is given as the input to the LSTM model to generate the second vector? If that is the case, does the third vector is the right choice to use in embedding applications?

schmmd commented 6 years ago

The publication has more information about the different vectors. The lower vectors represent more contextual information and the higher vectors represent more semantics.

radiantone commented 6 years ago

Good info showing up in this thread. However the ticket is requesting that these concepts be amended to a simple tutorial that shows the utility of the embeddings in some kind of practical use case. That anyone can understand.

matt-gardner commented 6 years ago

Thanks for the feedback @radiantone. We're a very small team, however, and our focus is on people who are actively doing research in natural language processing. That's why you don't see tutorials explaining these things. If you'd like to help out and make our tutorials better, PRs would be very welcome!

schmmd commented 6 years ago

I am actively improving the ELMo tutorial given your feedback:

Contributions are very welcome.

radiantone commented 6 years ago

Soon as I get good at this really cool tech I'll be happy to submit PRs. Just climbing the learning curve still. Thanks for a great set of tools!

schmmd commented 6 years ago

Closing, as we've updated the tutorial and have a specific example of how to read data from HDF5 files: https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md#writing-contextual-representations-to-disk