BlueBrain / Search

Blue Brain text mining toolbox for semantic search and structured information extraction
https://blue-brain-search.readthedocs.io
GNU Lesser General Public License v3.0
40 stars 10 forks source link

Run ETL pipeline and collect stats on downloads #572

Open FrancescoCasalegno opened 2 years ago

FrancescoCasalegno commented 2 years ago

Context

Once we are done completing the creation of the ETL pipeline to download, filter, and parse papers from the various sources (see #562), we need to run this pipeline for the first time to ensure that everything works fine and collect statistics about the results.

Actions

EmilieDel commented 2 years ago

Pubmed Analysis

Baseline files

For (half) of the baseline - 562 files:

Updates Files downloaded

Global numbers

For updates_files: all files between pubmed22n1115.xml.gz and pubmed22n1204.xml.gz (2021-12-13 - 2022-02-22 = 71 days):

What are the changes ?

Analysis between pubmed22n1124.xml (published on 2021-12-19) and pubmed22n1147.xml (published on 2022-01-11)