Open v-miazhang opened 1 year ago
Hi, thanks for mentioning it.
The .gz files we used for the database are sourced from the data dumps of the OpenAlex data set. Specifically, the files that correspond to the works (i.e. publications) in OpenAlex.
You can find a guide to retrieving the dump files in the documentation of OpenAlex here.
The files are located in their AWS S3 bucket and can be downloaded free of charge (browse the bucket).
Again, note that the files related to other entity types (authors, concepts etc.) do not have to be downloaded to reproduce our data set.
Only the ones in the works/
directory are relevant.
Hope this helps!
Thanks! I also see many updated time in the bucket, e.g., 2023-04-27, 2023-05-02. What are the differences between those dirs? Do they have overlaps?
There are no overlaps in the folders with the differing dates. These dates refer to the time the works (i.e. publications) in that particular folder were last updated. That means when starting from scratch, you'd need to consider all folders to obtain a full snapshot. More info in the OpenAlex docs
I am trying to re-produce the dataset. Following the instructions in /src, I have finished the step 1. In step 2, which call generate_openalex_db.py, in line 87
input_dir_openalex_works_files = r'/opt/unarXive_2022/openalex/openalex-works-2022-11-28/*'
Where can I get those .gz dumps, did you miss some of the steps between 1 and 2? Thanks!