ISWC-Reproducibility-Track / Paper_608

0 stars 0 forks source link

Example1 - Embeddings #2

Open angelosalatino opened 4 years ago

angelosalatino commented 4 years ago

Hi Guys, I followed @dgarijo advice. I am testing it using Docker.

However in this piece of code from Example1 - Embeddings

embeddings={}
with open('emb.txt', 'r') as f:
    header = next(f)
    for line in f:
        node1, label, embedding=line.split()
        embeddings[node1]=embedding.split(',')

I get:

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-8-6872d2bef108> in <module>()
      1 embeddings={}
      2 with open('emb.txt', 'r') as f:
----> 3     header = next(f)
      4     for line in f:
      5         node1, label, embedding=line.split()

StopIteration: 

Do you know why it is generating such error?

angelosalatino commented 4 years ago

Just to add further info.

I downloaded the docker. I run it using docker run -it -p 8888:8888 uscisii2/kgtk:latest /bin/bash -c "jupyter notebook --ip='*' --port=8888 --no-browser"

Then I went into kgtk/examples/ and I started running Example1 - Embeddings

angelosalatino commented 4 years ago

Actually even before that error I see a warning at this cell:

%%bash
kgtk import_conceptnet --english_only conceptnet-assertions-5.7.0.csv / \
            filter -p " ; /r/Causes,/r/UsedFor,/r/Synonym,/r/DefinedAs,/r/IsA ; " / sort -c 1,2,3 \
            | head -30000 |
            kgtk text_embedding --debug --embedding-projector-metadata-path none \
                    --embedding-projector-metadata-path none \
                    --label-properties "/r/Synonym" \
                    --isa-properties "/r/IsA" \
                    --description-properties "/r/DefinedAs" \
                    --property-value "/r/Causes" "/r/UsedFor" \
                    --has-properties "" \
                    -f kgtk_format \
                    --output-format kgtk_format \
                    --use-cache \
                    --model bert-large-nli-cls-token \
                    > emb.txt  

and I get

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/exceptions.py", line 42, in __call__
    return_code = func(*args, **kwargs) or 0
  File "/opt/conda/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/text_embedding.py", line 332, in run
    main(**kwargs)
  File "/opt/conda/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/text_embedding.py", line 208, in main
    property_labels_dict=property_labels_dict)
  File "/opt/conda/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/gt/embedding_utils.py", line 405, in read_input
    raise KGTKException("Missing column: {}".format(missing_column))
kgtk.exceptions.KGTKException: Missing column: {'label'}
Missing column: {'label'}
dgarijo commented 4 years ago

Hi @angelosalatino, It took a while, but I have been able to reproduce this problem. We'll look into it and get back to you.

dgarijo commented 4 years ago

@angelosalatino, I have an explanation and a solution:

Explanation of the problem: The reason why you are getting this error is that the command is expecting a file with a label column, but the file has a column named relation. These are equivalent, and that command supported both in a previous version. However, KGTK has undergone active development, and some of the commands have slightly changed and made more consistent. It looks like now the embeddings command only accepts file with label. We are adding more and more tests to detect these things, but we may have missed some like this one.

Solution: The easy solution is to separate the command in two:

kgtk import_conceptnet --english_only conceptnet-assertions-5.7.0.csv / \
            filter -p " ; /r/Causes,/r/UsedFor,/r/Synonym,/r/DefinedAs,/r/IsA ; " / sort -c 1,2,3 \
            | head -30000 > heads.kgtk

Then replace relation with column in the first row:

sed -i '1!b;s/relation/label/' heads.kgtk

And then calculate the embedding:

kgtk text-embedding -i heads.kgtk --debug --embedding-projector-metadata-path none --embedding-projector-metadata-path none --label-properties "/r/Synonym" --isa-properties "/r/IsA" --description-properties "/r/DefinedAs" --property-value "/r/Causes" "/r/UsedFor" --has-properties "" -f kgtk_format --output-format kgtk_format --use-cache --model bert-large-nli-cls-token > emb.txt

If you want to save time for rerunning the first command plus sed, I have done it in the attached file (I had to rename it from heads.kgtk to heads.txt because GitHub didn't like it.

Note that running the embedding may take a long time.

In the meantime, I have opened issue https://github.com/usc-isi-i2/kgtk/issues/164. We will be fixing it in the next days.

heads.txt

pminervini commented 4 years ago

Using a GPU Colab notebook for running bert-large may do the trick

On Fri, 16 Oct 2020, 06:28 Daniel Garijo, notifications@github.com wrote:

@angelosalatino https://github.com/angelosalatino, I have an explanation and a solution:

Explanation of the problem: The reason why you are getting this error is that the command is expecting a file with a label column, but the file has a column named relation. These are equivalent, and that command supported both in a previous version. However, KGTK has undergone active development, and some of the commands have slightly changed and made more consistent. It looks like now the embeddings command only accepts file with label. We are adding more and more tests to detect these things, but we may have missed some like this one.

Solution: The easy solution is to separate the command in two:

kgtk import_conceptnet --english_only conceptnet-assertions-5.7.0.csv / \ filter -p " ; /r/Causes,/r/UsedFor,/r/Synonym,/r/DefinedAs,/r/IsA ; " / sort -c 1,2,3 \ | head -30000 > heads.kgtk

Then replace relation with column in the first row:

sed -i '1!b;s/relation/label/' heads.kgtk

And then calculate the embedding:

kgtk text-embedding -i heads.kgtk --debug --embedding-projector-metadata-path none --embedding-projector-metadata-path none --label-properties "/r/Synonym" --isa-properties "/r/IsA" --description-properties "/r/DefinedAs" --property-value "/r/Causes" "/r/UsedFor" --has-properties "" -f kgtk_format --output-format kgtk_format --use-cache --model bert-large-nli-cls-token > emb.txt

If you want to save time for rerunning the first command plus sed, I have done it in the attached file (I had to rename it from heads.kgtk to heads.txt because GitHub didn't like it.

Note that running the embedding may take a long time.

In the meantime, I have opened issue usc-isi-i2/kgtk#164 https://github.com/usc-isi-i2/kgtk/issues/164. We will be fixing it in the next days.

heads.txt https://github.com/ISWC-Reproducibility-Track/Paper_608/files/5389172/heads.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ISWC-Reproducibility-Track/Paper_608/issues/2#issuecomment-709746825, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABXQHJ5T6AP3RMBVZB4FGDSK7DVXANCNFSM4SR4MTAQ .

angelosalatino commented 4 years ago

@angelosalatino, I have an explanation and a solution:

Explanation of the problem: The reason why you are getting this error is that the command is expecting a file with a label column, but the file has a column named relation. These are equivalent, and that command supported both in a previous version. However, KGTK has undergone active development, and some of the commands have slightly changed and made more consistent. It looks like now the embeddings command only accepts file with label. We are adding more and more tests to detect these things, but we may have missed some like this one.

Solution: The easy solution is to separate the command in two:

kgtk import_conceptnet --english_only conceptnet-assertions-5.7.0.csv / \
            filter -p " ; /r/Causes,/r/UsedFor,/r/Synonym,/r/DefinedAs,/r/IsA ; " / sort -c 1,2,3 \
            | head -30000 > heads.kgtk

Then replace relation with column in the first row:

sed -i '1!b;s/relation/label/' heads.kgtk

And then calculate the embedding:

kgtk text-embedding -i heads.kgtk --debug --embedding-projector-metadata-path none --embedding-projector-metadata-path none --label-properties "/r/Synonym" --isa-properties "/r/IsA" --description-properties "/r/DefinedAs" --property-value "/r/Causes" "/r/UsedFor" --has-properties "" -f kgtk_format --output-format kgtk_format --use-cache --model bert-large-nli-cls-token > emb.txt

If you want to save time for rerunning the first command plus sed, I have done it in the attached file (I had to rename it from heads.kgtk to heads.txt because GitHub didn't like it.

Note that running the embedding may take a long time.

In the meantime, I have opened issue usc-isi-i2/kgtk#164. We will be fixing it in the next days.

heads.txt

This solution did the trick. Thank you @dgarijo