Possibility to wrap my own data (already transformed into indices)

yifannieudem commented 5 years ago

Describe the Question

Please provide a clear and concise description of what the question is. Hi, I have my own datasets and want to use match zoo to perform some benchmark. My data is not under text format and is already transformed into indices like [326,148, 455, 236, 0, 0, 0] according to my dictionary (term to index mapping). I also have a pre-trained embedding according to this dictionary. Is it possible to directly pack my indexed data (non text form) into DataPack and directly feed the model? Thanks.

Describe your attempts

[ ] I walked through the tutorials
[ ] I checked the documentation
[ ] I checked to make sure that this is not a duplicate question

You may also provide a Minimal, Complete, and Verifiable example you tried as a workaround, or StackOverflow solution that you have walked through. (e.g. cosmic radiation).

In addition, figure out your MatchZoo version by running import matchzoo; matchzoo.__version__. If this gives you an error, then you're probably using 1.0, and 1.0 is no longer supported. Then attach the corresponding label on the issue.

bwanglzu commented 5 years ago

hi @yifannieudem

see data handling

yifannieudem commented 5 years ago

Thanks a lot for your reply. I have another question about how to use my pertained embeddings. I have this indexed indices to represent my texts, but I also have my own term_to_index mapping and the corresponding pre-trained embedding matrix where line k corresponds to the embedding vector or term whose index is k in the term2index mapping. I read through this data handling tutorial but the term2index mapping is automatically built during preprocessing (not my version). I may reassign my term2index to the vocab_state but I saw that the embedding reader only accept word2vec or glove format which is: term emb_vector..... If I want to use my embedding which is a mapping index-> emb_vec, can I create a text file as follows (where each line is index embedding_vector): 38 embedding_vector_for_term_38 16 embedding_vector_for_term_16 ....

Thanks

bwanglzu commented 5 years ago

Hi, you mentioned it's only accept word2vec or glove: no, as long as your embedding format is the same as word2vec or glove it's fine.

Look at word2vec, the format is the same as you mentioned above: 38 embedding_vector_for_term_38, this means you can do:

import matchzoo as mz
my_embedding_file_dir = ...
my_embedding = mz.embedding.load_from_file(my_embedding_file_dir)

Besides, if you want to freeze the pre-trained embeddings, set embedding_trainable=False.

See mode references

yifannieudem commented 5 years ago

Thanks a lot. I have another question, I saw in the examples that the preproceeesed train data by BasicPreprocessor not only has "id_left", "id_right". "text_left", "text_right" (which is shown in your example in your screenshot) but also includes "length_left", "length_right" attributes. Are those 2 lens necessary for directly using the packed data for training? In order for the model to use my data directly, the finally processed_data should have "id_left", "id_right". "text_left", "text_right", "length_left", "length_right" and each list in the text_left contains int indices and already padded according to my pre-defined max_len? Is this processing right? Thanks.

bwanglzu commented 5 years ago

hi @yifannieudem length_left and length_right will be automatically appended to the data pack if you use our pre-defined preprocessor, such as BasicPreprocoessor

if you wanna do it manually, just call your_data_pack.append_text_length(..).

uduse commented 5 years ago

I hope things are working well for you now. I’ll go ahead and close this issue, but I’m happy to continue further discussion whenever needed.

NTMC-Community / MatchZoo

Possibility to wrap my own data (already transformed into indices) #738

Describe the Question

Describe your attempts