BlueBrain / Search

Blue Brain text mining toolbox for semantic search and structured information extraction
https://blue-brain-search.readthedocs.io
GNU Lesser General Public License v3.0
40 stars 10 forks source link

Implement utils for downloading large amounts of papers #296

Open FrancescoCasalegno opened 3 years ago

FrancescoCasalegno commented 3 years ago

The goal of this ticket is to create capabilities to download large numbers of neuroscientific papers. Ideally these papers should be in a machine readable format like text, json, html, or xml—other more complex formats like pdf should be avoided for the moment as they would entail a way more complex processing.

We may leverage any of the public APIs that exist, see e.g. UC Berkley Library's page for a list of some of those APIs.

FrancescoCasalegno commented 3 years ago

Two things to be kept in mind:

  1. if possible, we want to select only papers related to a specific topic (e.g. "neuroscience") so that we don't ingest in the database lots of GB of material we don't really need
  2. if possible, we must try to download full-texts and not just title+abstract+metadata.