IllDepence / unarXive

A data set based on all arXiv publications, pre-processed for NLP, including structured full-text and citation network
MIT License
259 stars 19 forks source link

How can I get OpenAlex dump files? #14

Open v-miazhang opened 1 year ago

v-miazhang commented 1 year ago

I am trying to re-produce the dataset. Following the instructions in /src, I have finished the step 1. In step 2, which call generate_openalex_db.py, in line 87 input_dir_openalex_works_files = r'/opt/unarXive_2022/openalex/openalex-works-2022-11-28/*' Where can I get those .gz dumps, did you miss some of the steps between 1 and 2? Thanks!

johankit commented 1 year ago

Hi, thanks for mentioning it.

The .gz files we used for the database are sourced from the data dumps of the OpenAlex data set. Specifically, the files that correspond to the works (i.e. publications) in OpenAlex.

You can find a guide to retrieving the dump files in the documentation of OpenAlex here. The files are located in their AWS S3 bucket and can be downloaded free of charge (browse the bucket). Again, note that the files related to other entity types (authors, concepts etc.) do not have to be downloaded to reproduce our data set. Only the ones in the works/ directory are relevant.

Hope this helps!

zhangmiaosen2000 commented 1 year ago

Thanks! I also see many updated time in the bucket, e.g., 2023-04-27, 2023-05-02. What are the differences between those dirs? Do they have overlaps?

johankit commented 1 year ago

There are no overlaps in the folders with the differing dates. These dates refer to the time the works (i.e. publications) in that particular folder were last updated. That means when starting from scratch, you'd need to consider all folders to obtain a full snapshot. More info in the OpenAlex docs