bokulich-lab / q2-fondue

Functions for reproducibly Obtaining and Normalizing Data re-Used from Elsewhere
BSD 3-Clause "New" or "Revised" License
20 stars 6 forks source link

FIX: Correctly scraping accessions with whitespace characters #133

Closed adamovanja closed 2 years ago

adamovanja commented 2 years ago

This PR resolves cases where accession IDs were not correctly scraped from publications due to the presence of whitespace characters in the middle of the accession ID sequence.

Testing

Try and scrape a collection with one of these two publications within: 10.3389/fmicb.2018.02755, 10.1186/s40168-015-0089-2

The former implementation would have returned:

With the changes the following output should be obtained in study_ids.qza:

ID  DOI
SRP132205   ['10.3389/fmicb.2018.02755']
ERP001911   ['10.1186/s40168-015-0089-2']
codecov[bot] commented 2 years ago

Codecov Report

Merging #133 (fe8ff5b) into main (89a59dd) will increase coverage by 0.01%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #133      +/-   ##
==========================================
+ Coverage   95.62%   95.63%   +0.01%     
==========================================
  Files          15       15              
  Lines        1165     1168       +3     
  Branches      216      216              
==========================================
+ Hits         1114     1117       +3     
  Misses         26       26              
  Partials       25       25              
Impacted Files Coverage Δ
q2_fondue/scraper.py 98.70% <100.00%> (+0.02%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 89a59dd...fe8ff5b. Read the comment docs.