NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.84k stars 897 forks source link

Possibility to wrap my own data (already transformed into indices) #738

Closed yifannieudem closed 5 years ago

yifannieudem commented 5 years ago

Describe the Question

Please provide a clear and concise description of what the question is. Hi, I have my own datasets and want to use match zoo to perform some benchmark. My data is not under text format and is already transformed into indices like [326,148, 455, 236, 0, 0, 0] according to my dictionary (term to index mapping). I also have a pre-trained embedding according to this dictionary. Is it possible to directly pack my indexed data (non text form) into DataPack and directly feed the model? Thanks.

Describe your attempts

You may also provide a Minimal, Complete, and Verifiable example you tried as a workaround, or StackOverflow solution that you have walked through. (e.g. cosmic radiation).

In addition, figure out your MatchZoo version by running import matchzoo; matchzoo.__version__. If this gives you an error, then you're probably using 1.0, and 1.0 is no longer supported. Then attach the corresponding label on the issue.

bwanglzu commented 5 years ago

hi @yifannieudem image

see data handling

yifannieudem commented 5 years ago

Thanks a lot for your reply. I have another question about how to use my pertained embeddings. I have this indexed indices to represent my texts, but I also have my own term_to_index mapping and the corresponding pre-trained embedding matrix where line k corresponds to the embedding vector or term whose index is k in the term2index mapping. I read through this data handling tutorial but the term2index mapping is automatically built during preprocessing (not my version). I may reassign my term2index to the vocab_state but I saw that the embedding reader only accept word2vec or glove format which is: term emb_vector..... If I want to use my embedding which is a mapping index-> emb_vec, can I create a text file as follows (where each line is index embedding_vector): 38 embedding_vector_for_term_38 16 embedding_vector_for_term_16 ....

Thanks

bwanglzu commented 5 years ago

Hi, you mentioned it's only accept word2vec or glove: no, as long as your embedding format is the same as word2vec or glove it's fine.

Look at word2vec, the format is the same as you mentioned above: 38 embedding_vector_for_term_38, this means you can do:

import matchzoo as mz
my_embedding_file_dir = ...
my_embedding = mz.embedding.load_from_file(my_embedding_file_dir)

Besides, if you want to freeze the pre-trained embeddings, set embedding_trainable=False.

See mode references

yifannieudem commented 5 years ago

Thanks a lot. I have another question, I saw in the examples that the preproceeesed train data by BasicPreprocessor not only has "id_left", "id_right". "text_left", "text_right" (which is shown in your example in your screenshot) but also includes "length_left", "length_right" attributes. Are those 2 lens necessary for directly using the packed data for training? In order for the model to use my data directly, the finally processed_data should have "id_left", "id_right". "text_left", "text_right", "length_left", "length_right" and each list in the text_left contains int indices and already padded according to my pre-defined max_len? Is this processing right? Thanks.

bwanglzu commented 5 years ago

hi @yifannieudem length_left and length_right will be automatically appended to the data pack if you use our pre-defined preprocessor, such as BasicPreprocoessor

if you wanna do it manually, just call your_data_pack.append_text_length(..).

uduse commented 5 years ago

I hope things are working well for you now. I’ll go ahead and close this issue, but I’m happy to continue further discussion whenever needed.