PaperScraper facilitates the extraction of text and meta-data from scientific journal articles for use in NLP systems. In simplest application, query by the URL of a journal article and receive back a structured JSON object containing the article text and metadata. More robustly, query by relevant attribute tags of articles (ie. DOI, Pubmed ID) and have an article URL automatically found and extracted from.
Retrieve structured journal articles in three lines:
from paperscraper import PaperScraper
scraper = PaperScraper()
print(scraper.extract_from_url("https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3418173/"))
{
"title": "Gentamicin-loaded nanoparticles show improved antimicrobial effects towards Pseudomonas aeruginosa infection"
"abstract": "...",
"body": "...",
"authors": {
"a1": {"first_name": "Sharif", "last_name": "Abdelghany"},
"a2": {"first_name": "Derek", "last_name": "Quinn"},
"a3": {"first_name": "Rebecca", "last_name": "Ingram"},
"a4": {"first_name": "Brendan", "last_name": "Gilmore"},
"a5": {"first_name": "Ryan", "last_name": "Donnelly"},
"a6": {"first_name": "Clifford", "last_name": "Taggart"},
"a7": {"first_name": "Christopher", "last_name": "Scott"}
},
"doi": "10.2147/IJN.S34341",
"keywords": [
"anti-microbial",
"gentamicin",
"PLGA nanoparticles",
"Pseudomonas aeruginosa"
],
"pdf_url": "https://www.ncbi.nlm.nih.gov//pmc/articles/PMC3418173/pdf/ijn-7-4053.pdf"
}
or use a domain-specific aggregator such as PubMed and let PaperScraper automatically find a link for you:
from paperscraper import PaperScraper
scraper = PaperScraper()
print(scraper.extract_from_pmid("22915848"))
Journal | Scraper |
---|---|
Science Direct | :heavy_check_mark: |
Pubmed Central (PMC) | :heavy_multiplication_x: |
Springer | :heavy_check_mark: |
American Chemical Society (ACS) | :heavy_multiplication_x: |
Royal Society of Chemistry (RSC) | :heavy_check_mark: |
To contribute an additional scraper to PaperScraper simply do the following (detailed instructions found in section 'Example Contribution Development Set-up'):
Follow the following formatting standards when developing a scraper:
The OrderedDict containing the paper body should be structured as follows:
{
"body": {
"Name of section": {
"Name of nested section": {
"p1": "The raw text of first paragraph"
}
"p2": "raw text of second paragraph"
},
"p3":"Raw text of third paragraph",
"p4":"Raw text of fourth paragraph"
}
}
We recommend using an IDE such as PyCharm to facilitate the contribution process. It is available for free if you are affiliated with a university . This contribution walk-through assumes that you are utilizing PyCharm Professional Edition.
python setup.py
to install PaperScraper and its dependencies into your virtual environment.python setup.py test
to run all tests. Insure that you have an internet connection as some tests require it. Further tests (along with only running single test files) can be executed by the command 'nosetests' (details here).Ensure that you have an internet connection before testing.
To execute all tests, run the command python setup.py test
from the top-level directory.
To execute a single test, run the command nosetests -s <test_file_path>
. The -s flag will allow print statements to print to console. Please remove all print statements before submitting a pull request.
Check out the Nose testing documentation here.
If you are experiencing errors running tests, make sure Nose is running with a version of python 3.5 or greater.
If it is not, it is likely an error with Nose not being installed in your virtual environment. Execute the command pip install nose -I
to correctly install it.
When writing tests, cover scraping from a few different correct and incorrect URLs. Also test that there is valid output for key sections such as 'authors' and 'body'. Please follow the naming convention for your test files. Refer to the test_sciencedirect.py file as a template for your own tests.
This package is licensed under the GNU General Public License