ISWC-Reproducibility-Track / Paper_608

0 stars 0 forks source link

Example 4 and 6 #5

Open angelosalatino opened 3 years ago

angelosalatino commented 3 years ago

Hi @dgarijo, all I am having issues in running the example 4 and 6.

at the beginning of both example we find the definition of two environment variables:

%env MY=/Users/pedroszekely/data/wikidata-20200504
%env WD=/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504

The definition of these variables are useful for the next cell:

!time gzcat "$WD/wikidata_edges_20200504.tsv.gz" \
   | kgtk  ifexists --filter-on "$WD/wikidata_edges_20200504.tsv.gz" --input-keys node2 --filter-keys node1 \
   | gzip > "$MY/wikidata-item-edges.tsv.gz"

So I changed WD according to the position of wikidata_edges_20200504.tsv.gz in the docker instance I am running, setting them to:

%env MY=/kgtk/examples/
%env WD=/kgtk/examples/

because the complete filepath is /kgtk/examples/wikidata_edges_20200504.tsv.gz (where I mounted it).

The problem is that when I run the cell:

!time gzcat "$WD/wikidata_edges_20200504.tsv.gz" \
   | kgtk  ifexists --filter-on "$WD/wikidata_edges_20200504.tsv.gz" --input-keys node2 --filter-keys node1 \
   | gzip > "$MY/wikidata-item-edges.tsv.gz"

I get:

/bin/sh: 1: time: not found
No header line in file

Since this is a bash command, I opened another terminal with docker exec -it <container-id> bash to launch that command in there (including the env variables) and it looks like the time command is installed. But then I get

bash: gzcat: command not found
No header line in file

I tried to install gzcat using apt install gzcat but then I get:

Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package gzcat

I am not sure on how to proceed.

Thank you.

dgarijo commented 3 years ago

Hmm in Debian I see it's not called gzcat, but zcat. Can you please give it a try?

dgarijo commented 3 years ago

In Notebook 6 there are no gzcat commands. What seems to be the issue there?

angelosalatino commented 3 years ago

Hmm in Debian I see it's not called gzcat, but zcat. Can you please give it a try?

Hi Daniel, yes you are right. zcat is there. I am running it now. It is taking a while. I will see what happens and let you know.

angelosalatino commented 3 years ago

In Notebook 6 there are no gzcat commands. What seems to be the issue there?

You are right. Sorry it was my bad. The problem with Example 6 is the dependencies like in Example 4.

In this notebook we can find:

%env WD18=/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20181210
%env WD18temp=/Users/pedroszekely/Downloads
#%env TN=sample_data/tables
%env TN=/Users/pedroszekely/Downloads/tn
%env R=/Users/pedroszekely/Downloads/tn/results
%env WT=wt.10000

I changed these lines with:

%env WD18=/kgtk/examples
%env WD18temp=/kgtk/examples
#%env TN=sample_data/tables
%env TN=/kgtk/examples/sample_data/tables
%env R=/kgtk/examples/sample_data/tables/results
%env WT=wt.10000

This is because my wikidata file is in /kgtk/examples/

however, I cannot find the file wt.10000 (sample of 10,000 from the ntiples files available in https://github.com/bfetahu/wiki_tables_kg/) Where can I fetch this file?

dgarijo commented 3 years ago

Oh my, this link should have been made public!

I apologize. It's here: https://drive.google.com/file/d/1gXYFqyqPtjvfYvFjHl53sKMF0Y489JHz/view?usp=sharing

Thanks for pushing forward all notebooks. I personally tested most of them, but it looks like it was not enough. I am opening issues about all these things so we get them fixed (if not already fixed).

angelosalatino commented 3 years ago

Hi @dgarijo, it seems like I am stuck again in Example 4. at the 12th cell I can find !gzcat "$WD/wikidata-pagerank-only-sorted.tsv.gz" | head

but I couldn't find the file wikidata-pagerank-only-sorted.tsv.gz. is this file generated by the example somehow, and I missed the step to generate it or I need to download it from somewhere?

Thank you in advance, Angelo

dgarijo commented 3 years ago

That's another link which I forgot to share. You can find it here: https://drive.google.com/file/d/1m4x3Wpl8armvao6RCWlNyVT_IfpdHhHJ/view?usp=sharing

Apologies.

angelosalatino commented 3 years ago

Hi @dgarijo, perfect. I miss also wikidata_labels.tsv

Another note I would like to make is: In one of the following cells in Example 4, there is this command: kgtk rename_col "$WD/wikidata-pagerank-only-sorted.tsv.gz" --mode NONE --output-columns node1 label node2 id | gzip > $MY/wikidata-pagerank-only-sorted.tsv.gz when I run it, I get:

usage: kgtk [options] command [ / command]*
kgtk: error: argument command: invalid choice: 'rename-col' (choose from 'add-id', 'calc', 'cat', 'clean-data', 'compact', 'connected-components', 'expand', 'explode', 'denormalize_node2', 'export-gt', 'export-neo4j', 'export-wikidata', 'filter', 'generate-mediawiki-jsons', 'generate-wikidata-triples', 'graph-statistics', 'ifempty', 'ifexists', 'ifnotempty', 'ifnotexists', 'implode', 'import-atomic', 'import-concept-pairs', 'import-conceptnet', 'import-framenet', 'import-ntriples', 'import-visualgenome', 'import-wikidata', 'import-wordnet', 'join', 'lift', 'lower', 'md', 'normalize-nodes', 'paths', 'reachable-nodes', 'remove-columns', 'rename-columns', 'reorder-columns', 'sort', 'sort2', 'text-embedding', 'unique', 'unreify-rdf-statements', 'unreify-values', 'validate-properties', 'validate', 'zconcat')

I therefore changed the 'rename-col' with 'rename-columns', using the following command

!kgtk rename-columns "$WD/wikidata-pagerank-only-sorted.tsv.gz" --mode NONE --output-columns node1 label node2 id | gzip > $MY/wikidata-pagerank-only-sorted.tsv.gz

Does it make sense? Thanks

dgarijo commented 3 years ago

@angelosalatino, yes, the rename_columns command is the latest version: https://kgtk.readthedocs.io/en/latest/transform/rename_columns/ ; it has been changed in recent releases.

I think this is the missing file: https://drive.google.com/file/d/1lihRjbAuwGEDAmz_dvH8kKidvMvd1q7D/view?usp=sharing (it is compressed)

angelosalatino commented 3 years ago

No I am stuck at this command: !time kgtk cat "$MY/wikidata_labels_etc.tsv" $MY/pagerank.tsv | gzip > $MY/pagerank-and-labels.tsv.gz

I cannot find pagerank.tsv

Do I need to download it too?

dgarijo commented 3 years ago

I think it may be the extracted file from $MY/wikidata-pagerank-only-sorted.tsv.gz

However, I am confirming it with our team.

I think this notebook 4 and notebook 6 have to be revised and updated to the newest version. I generated issues https://github.com/usc-isi-i2/kgtk/issues/169 and https://github.com/usc-isi-i2/kgtk/issues/167 to reflect all these interactions we are having (thanks so much btw).

dgarijo commented 3 years ago

@angelosalatino I have confirmed it, please give a try to extract and use the tsv from wikidata-pagerank-only-sorted.tsv.gz

angelosalatino commented 3 years ago

Hi @dgarijo I gunzipped $MY/wikidata-pagerank-only-sorted.tsv.gz I obtained $MY/wikidata-pagerank-only-sorted.tsv

I updated the command as !time kgtk cat "$MY/wikidata_labels_etc.tsv" $MY/wikidata-pagerank-only-sorted.tsv | gzip > $MY/pagerank-and-labels.tsv.gz

But I get as result:

In input 2 header 'node1    property    node2   id': Missing required column: label | predicate | relation | relationship
Exit requested

real    0m1.477s
user    0m1.381s
sys 0m0.100s

Not sure how to proceed here.

dgarijo commented 3 years ago

@angelosalatino, it looks like a similar error we faced before. It's expecting label, but the file still has property. Can you do a sed like we did in the other issue to replace property by label?

sed -i '1!b;s/relation/label/' wikidata-pagerank-only-sorted.tsv