amritasaha1812 / CSQA_Code

59 stars 20 forks source link

Code for pre-processing wikidata json dump? #6

Closed sanyam5 closed 5 years ago

sanyam5 commented 6 years ago

Could you please also share the code that you use for pre-processing the wikidata json dump? This would be an enormous help. Thanks!

vardaan123 commented 6 years ago

The relevant code which you are talking about involves multiple stages of entity and relation filtering, and we have had dozens of scripts to do that. Unfortunately, we can't share that as it is not properly documented, and also the original dump is not required to deal with the dataset in any way. You could take a look at https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON, read the original dump line by line and extract just the property info.

sanyam5 commented 6 years ago

Hey @vardaan123 , I can totally understand. The reason I wanted the pre-processing script is that I am having trouble understanding the pre-processed files. Specifically,

1) I could not find wikidata_short_1.json and wikidata_short_2.json after extracting the zip file as mentioned on the download page. I believe these are the files that contain the actual KB triples? I did however find comp_wikidata_rev.json which, I believe, contains the triples but in the reverse order. I could possibly extract the real triples by reversing the triples from comp_wikidata_rev.json. But I was wondering if this was done on purpose?

2) In your paper, you mention shortlisting 330 relations. However, on parsing the comp_wikidata_rev.json I found 357 unique relation ids. Why this difference?

3) From my understanding, it seems that wikidata_type_dict.json contains the projection of actual triples onto the entity types. But I found just 335 unique relation ids which is less than the 357 of comp_wikidata_rev.json ( see point 2). What is the reason for this?

4) Between items_wikidata_n.json and child_par_dict_name_2_corr.json I see that some labels have been modified. Like administrative territorial entity of Cyprus became administrative territory of Cyprus, administrative territorial entity of the United States became US administrative territory and types of tennis match became tennis match. What is the reason for this?

5) Is type of an entity an explicit attribute in Wikidata dump or is it derived from the hierarchy? (Sorry, I am not very familiar with the Wikidata schema). I ask this because I did not understand the following statement in your paper:-

Similarly, of the 30.8K unique
entity types in wikidata, we selected 642 types (considering
only immediate parents of entities) which appeared in the top
90 percentile of the tuples associated with atleast one of the
retained meaningful relations

By considering only immediate parents of entities do you mean that the number of distinct types of parent entities is 642 or do you mean there is some connection between the type of an entity and its parent? If its the latter, what connection do they have?

6) Going by the wikidata_type_dict.json it seems as though it has 2495 entity types instead of just 642 types mentioned in the paper.

It would be great if you could help me understand the reasons for the above. I have a few more doubts but I am hoping they will get cleared once I can understand the reasons for above points.

Thanks,

vardaan123 commented 6 years ago

Hey, the zip file is only for the dialogs. The wikidata jsons are shared in a separate google drive folder (https://drive.google.com/drive/folders/1ITcgvp4vZo1Wlb66d_SnHvVmLKIqqYbR?usp=sharing) which is given on the website. All the req. wikidata jsons are given in this dir.

vardaan123 commented 6 years ago
  1. To extract the forward triples, you just need a concat of wikidata_short_1.json and wikidata_short_2.json.
  2. The relations/entities in some of these jsons might be a super-set of what is actually used while instantiating the templates. It is kind of troublesome to update the jsons everytime we discard some relations.
  3. See point 2.
  4. We reduce the verbosity of some entity names based on feedback received from a set of researchers who tried to use this dataset.
  5. The type info is not explicitly encoded in wikidata. We consider 642 entity types, because they cover 90 percentile of the tuples. There are some properties like "instance_of" through which you can get an idea of the type.
  6. see above Also, it is much better if you could send a consolidated email after studying the code/dataset in detail. We don't like to disappoint people, but we have limited bandwidth to answer queries.
sanyam5 commented 6 years ago

@vardaan123 , Thanks for the answers! I apologise if you felt that I was taking too much of your time.

From the point of view of reproducibility of results, it becomes very difficult if one does not have access to the exact environment that those results were produced in. Could you please provide the original dataset on which the results mentioned in the paper were obtained? Or a paper having updated results as per the updated dataset?

Regarding wikidata_short_1.json and wikidata_short_2.json: I used the same link to download but it seems that Google Drive is messing up while it makes the zip file (which includes all the files in your directory). I'll download each file manually.

Thanks,

vardaan123 commented 6 years ago

you could use a script like this to download Google Drive files

import requests

def download_file_from_google_drive(id, destination):
    def get_confirm_token(response):
        for key, value in response.cookies.items():
            if key.startswith('download_warning'):
                return value

        return None

    def save_response_content(response, destination):
        CHUNK_SIZE = 32768

        with open(destination, "wb") as f:
            for chunk in response.iter_content(CHUNK_SIZE):
                if chunk: # filter out keep-alive new chunks
                    f.write(chunk)

    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

if __name__ == "__main__":
    import sys
    if len(sys.argv) is not 3:
        print("Usage: python google_drive.py drive_file_id destination_file_path")
    else:
        # TAKE ID FROM SHAREABLE LINK
        file_id = sys.argv[1]
        # DESTINATION FILE ON YOUR DISK
        destination = sys.argv[2]
        download_file_from_google_drive(file_id, destination)
vardaan123 commented 6 years ago

We will get back to you regarding other ques. soon.

vardaan123 commented 6 years ago

@sanyam5 The paper is due to be updated on arxiv with latest dataset figures. Please stay tuned.