biobricks-ai / OpenStemDocs

1 stars 0 forks source link

brick directions #4

Open tomlue opened 2 weeks ago

tomlue commented 2 weeks ago

I think this brick could be completed in 2 stages:

01_get_open_access_pdfs.py deps: none outs: brick/open_alex_open_access_pdfs.parquet script that pulls all open access pdf urls from openalex. This script needs to work in a smart way so that it can be rerun to look for new updates without requiring all of the work to be done over again. You could query openalex with a publication date based on greatest publication date found in the already downloaded pdfs.

02_download_pdfs.py deps: brick/open_alex_open_access_pdfs.parquet outs: brick/open_access_pdfs.pdf/* brick/open_access_pdfs.parquet script that gets the urls from stage 01 and downloads all the pdfs to the open_access_pdfs directory. it should also store metadata in the open_access_pdfs.parquet (like linking the download url to the path of the downloaded pdf). It should save the pdfs with the filename based on a content hash of the pdf. In the future, it may depend on other stages that use other methods of finding open access pdfs

mahinth1 commented 1 week ago

Stage 1: 01_get_openaccess.py; check_downloaded_url.py (remove duplicates)

stage 2: 02_download.py

mahinth1 commented 1 week ago

script to remove duplicate should be remove_duplicates.py

mahinth1 commented 2 days ago

pdfs are being downloaded.