bokulich-lab / q2-fondue

Functions for reproducibly Obtaining and Normalizing Data re-Used from Elsewhere
BSD 3-Clause "New" or "Revised" License
20 stars 6 forks source link

FIX: metadata should only retain the runs that were requested #147

Closed misialq closed 1 year ago

misialq commented 1 year ago

This PR fixes the issue described on the Q2 forum where the metadata fetched for the requested run IDs contains some erroneous entries resulting from addition of SRA samples without associated runs (when we request metadata from NCBI we receive metadata for samples that contained our run but also other samples that were part of the same experiment package). Those entries are missing most of the required information and can (and should) be dropped from the final results' DataFrame.

codecov[bot] commented 1 year ago

Codecov Report

Merging #147 (fc7f4fd) into main (3ede75c) will increase coverage by 0.04%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #147      +/-   ##
==========================================
+ Coverage   98.57%   98.61%   +0.04%     
==========================================
  Files          29       29              
  Lines        2943     2959      +16     
==========================================
+ Hits         2901     2918      +17     
+ Misses         42       41       -1     
Impacted Files Coverage Δ
q2_fondue/entrezpy_clients/_efetch.py 96.06% <100.00%> (+0.45%) :arrow_up:
q2_fondue/tests/test_efetch.py 99.37% <100.00%> (+<0.01%) :arrow_up:
q2_fondue/tests/test_metadata.py 99.69% <100.00%> (+0.01%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

misialq commented 1 year ago

Hey @adamovanja, I added the test you mentioned. I'm just thinking, do you think you could test this with some unrelated (small) set of IDs? Preferably, sample IDs or some type other than runs, just to make sure all is ok (I already tried with project IDs). Thanks!