Open FrancescoCasalegno opened 2 years ago
For (half) of the baseline - 562 files:
For updates_files: all files between pubmed22n1115.xml.gz
and pubmed22n1204.xml.gz
(2021-12-13 - 2022-02-22 = 71 days):
Analysis between pubmed22n1124.xml
(published on 2021-12-19) and pubmed22n1147.xml
(published on 2022-01-11)
Context
Once we are done completing the creation of the ETL pipeline to download, filter, and parse papers from the various sources (see #562), we need to run this pipeline for the first time to ensure that everything works fine and collect statistics about the results.
Actions
filter_config
file by talking to scientists. https://github.com/BlueBrain/Search/blob/e2704e2413efddde016864612822d1f787e85dd4/src/bluesearch/entrypoint/database/topic_filter.py#L58-L63--from_date
equal to the last month).arxiv
,biorxiv
,pmc
, ...) we want to know: