I think this brick could be completed in 2 stages:
01_get_open_access_pdfs.py
deps: none
outs: brick/open_alex_open_access_pdfs.parquet
script that pulls all open access pdf urls from openalex. This script needs to work in a smart way so that it can be rerun to look for new updates without requiring all of the work to be done over again. You could query openalex with a publication date based on greatest publication date found in the already downloaded pdfs.
02_download_pdfs.py
deps: brick/open_alex_open_access_pdfs.parquet
outs:
brick/open_access_pdfs.pdf/*
brick/open_access_pdfs.parquet
script that gets the urls from stage 01 and downloads all the pdfs to the open_access_pdfs directory. it should also store metadata in the open_access_pdfs.parquet (like linking the download url to the path of the downloaded pdf). It should save the pdfs with the filename based on a content hash of the pdf. In the future, it may depend on other stages that use other methods of finding open access pdfs
I think this brick could be completed in 2 stages:
01_get_open_access_pdfs.py deps: none outs: brick/open_alex_open_access_pdfs.parquet script that pulls all open access pdf urls from openalex. This script needs to work in a smart way so that it can be rerun to look for new updates without requiring all of the work to be done over again. You could query openalex with a publication date based on greatest publication date found in the already downloaded pdfs.
02_download_pdfs.py deps: brick/open_alex_open_access_pdfs.parquet outs: brick/open_access_pdfs.pdf/* brick/open_access_pdfs.parquet script that gets the urls from stage 01 and downloads all the pdfs to the open_access_pdfs directory. it should also store metadata in the open_access_pdfs.parquet (like linking the download url to the path of the downloaded pdf). It should save the pdfs with the filename based on a content hash of the pdf. In the future, it may depend on other stages that use other methods of finding open access pdfs