Using our own dataset - Githubissues

maral1988 commented 5 years ago

@ Hi I try to run code using my own data set. I saved 2 arrays one for text and one for tags in npz file as mentioned in readme . but in shared_dataset.py there is error :

with np.load(path) as f: brs = f['brs'] ms = f['ms'] sfs = f['sfs']

what is brs amd ms and sfs?? how we should save our arrays to npz file to have these 3 variables??

Tangworld commented 5 years ago

Hi, Actually, 'brs, ms and sfs' in sharee_dataset.py are all keys, and 'brs' refers to texts, 'sfs' refers to tags and 'ms' refers to masks. 'ms' is actually not used, so you could set it randomly, like [1, 0, 0, 0, 1, 1] and keep the length of 'ms' equals to the length of 'brs'. You may use: texts =np.array(["text1","text2","text3","text4"]); hashtags =np.array(["h1","h2","h1","h3","h3]); masks = np.array([1, 0, 0, 1]); np.savez('test.npz', brs=texts, sfs=hashtags, ms=masks);

Best wishes

maral1988 commented 5 years ago

well I save an example as follows : brs =np.array(["hi","hellow","book","laptop","trump","iran","war"]); sfs =np.array(["h1","h2","h1","h3","h3","h3","h1"]); ms = [0,1,1,1,0,1,0] np.savez('test.npz',brs=brs , sfs=sfs , ms=ms);

now I got this error : shared_dataset line 46 :

brs = [[w if(w < num_words) else 0 for w in x] for x in brs] TypeError: unorderable types: str() < int()

Tangworld commented 5 years ago

Well.....You have to construct a word dictionary which maps words to ids and a tag dictionary which maps tags to ids. So brs and sfs are arrays of ids. For example: Word dictionary: 1: hellow 2: book 3: hi 4: trump 5: laptop 6: iran 7: war Tag dictionary: 1: h2 2: h1 3: h3

Then brs =np.array([3, 1, 2, 5, 4, 6, 7]); sfs =np.array([2, 1, 2, 3, 3, 3, 2]);

Next step, you need construct shared.txt shared.txt contains a dictionary, which map the same word in texts and tags. In shared.txt, key is the id of a word and value is the id of a tag which is the same as the word.

Best wishes

maral1988 commented 5 years ago

That's ok . Thanks

maral1988 commented 5 years ago

I followed the instructions and and created shared.txt . Now after loading npz file the brs is : 1,5,3,7,2,4,6 but there is error in line 46 again :: brs = [[w if(w < num_words) else 0 for w in x] for x in brs] TypeError: 'numpy.int32' object is not iterable

I searched it in stackoverflow to how to fix it but I dont want to edit your code to correct it. where I am wrong now?

Tangworld commented 5 years ago

Well...Sorry, I didn't explain it clearly. brs, sfs and ms are all two-dimensional arrays, and the first dimension of each array is total amount of samples. For example, if your dataset contains 3 samples, brs = np.array([[3, 1, 2, 5, 4, 6, 7], [3, 4, 6, 2, 7, 3, 3], [4, 2, 5, 8, 2, 1, 6]]); And sfs, ms is the same as this.

maral1988 commented 5 years ago

Again there is error. Would you please give a very simple example of data ? or complete this example please :
text1 = "I love this movie" #h1 #h2 text2 = Very helpful pyhton library #h3 #h4 #h5 brs = [[1,2,3,4],[5,6,7,8,9]] # numbers represent words id sfs = [[h1,h2],[h3,h4,h5]]
ms = [[0,0],[1.0]] # not important we can ignore this array Now what would shared.txt look like? I have now this error in shared_dataset

sf_lens = [len(sf) for sf in sfs] TypeError: object of type 'filter' has no len()

Thanks

Tangworld commented 5 years ago

text1 = "I love python" #python text2 = Very helpful python library #python #library

brs = np.array([[1, 2, 3], [4, 5, 3, 6]]) sfs = np.array([[1], [1, 2]]) ms = np.array([[1, 0, 1], [1, 0, 1, 1]])

Then shared.txt: {3: 1} Because 'python' appears in both texts and tags.

maral1988 commented 5 years ago

Thanks. I think now I created correct data and shared.txt : brs =[[3, 1, 2, 4],[3,2,1,6],[6,7,1,5]] sfs =[[1,2],[1,2],[1]] ms = [[0,1,0,1],[0,1,1,1],[1,1,0,0]] shared.txt : {3:1,1:2}

Now code has this error in shared_dataset: sf_lens = [len(sf) for sf in sfs] TypeError: object of type 'filter' has no len() I think it related to this code : sfs = [filter(lambda x: x < num_sfs, sf) for sf in sfs] thanks

maral1988 commented 5 years ago

Do you have not any idea about this error? would you please email me a piece of your data to run iTag successfully? Thanks for your help

Tangworld commented 5 years ago

I used the data you provided and everything is ok. Try this:

import numpy as np
import shared_dataset as sh

def create_data():
    brs = np.array([[3, 1, 2, 4],[3,2,1,6],[6,7,1,5]])
    ms = np.array([[0,1,0,1],[0,1,1,1],[1,1,0,0]])
    sfs = np.array([[1,2],[1,2],[1]])
    np.savez('test.npz', brs=brs, ms=ms, sfs=sfs)

def main():
    (en_train, ms_train, de_train, y_train), (en_test, ms_test, de_test, y_test) = sh.load_data(path='test.npz', num_words=10000, num_sfs=1003)
    print(en_train)
    print(ms_train)
    print(de_train)
    print(y_train)
    print('--------------------------------')
    print(en_test)
    print(ms_test)
    print(de_test)
    print(y_test)

if __name__ == '__main__':
    create_data()
    main()

maral1988 commented 5 years ago

I solved the error . in python 3.5 the code sf_lens = [len(sf) for sf in sfs] shoul be chenged to sf_lens = [len(list(sf)) for sf in sfs] because in python 3 filter object hasn't method len and we should cast it to list first https://stackoverflow.com/questions/24291604/find-the-length-of-a-filter-object-in-python-3

stxupengyu commented 4 years ago

yes, python3 has the different meaning about the function: filter(). filter() function return a list in python2.7, but we can use a list() function to convert it in python3.

sfs = [filter(lambda x: x < num_sfs, sf) for sf in sfs]
sfs = [list(filter(lambda x: x < num_sfs, sf) for sf in sfs)]

SoftWiser-group / iTag

Using our own dataset #4