`--o-failed-runs` remains empty after running `get-metadata` with missing run IDs

adamovanja commented 2 years ago

When fetching metadata for a list of Study IDs with the action get-metadata a list of failed run IDs is printed to stdout but the returned Q2 artifact --o-failed-runs remains empty.

Steps to reproduce:

Unzip the attached file study_ids.tsv.zip containing a .tsv file with study IDs and associated DOI names.

Import the .tsv file into an NCBIAccessionIDs artifact, with

qiime tools import \
  --type NCBIAccessionIDs \        
  --input-path study_ids.tsv \
  --output-path study_ids.qza

Run get-metadata with this NCBIAccessionIDs artifact, with:

qiime fondue get-metadata \                    
  --i-accession-ids study_ids.qza \
  --p-n-jobs 4 \
  --p-email <email> \
  --o-metadata metadata.qza \
  --o-failed-runs failed_metadata.qza --verbose

Expected behaviour: The command returns failed_metadata.qza that contains the run IDs for which the fetching of metadata failed.

Actual behaviour: The command runs without returning an error, printing the failed run IDs to stdout but not saving them to the failed_metadata.qza artifact.

2022-05-26 10:14:47,939 [MainThread] [WARNING] [q2_fondue.metadata]: Metadata for the following run IDs could not be fetched: SRR7048116,SRR7058810,ERR2272435,SRR7058926,SRR12665702,ERR3405907,SRR7054431,ERR3402914,ERR3575646,ERR2761570,SRR7058914,ERR3575977,ERR3403324,SRR7053433,SRR7044328,ERR3405742,SRR7039399,SRR7041421,SRR7813309,ERR3575561,SRR7217653,SRR7058860,ERR3520103,SRR7522052,SRR7058874,SRR7521985,ERR2761314,ERR3406273,SRR7217477,SRR7041434,ERR3404833,SRR7053510,SRR7046370,SRR7053743,SRR7039333,ERR3405644,ERR3575446,ERR3405556,ERR2761188,ERR3405398,SRR7813275,ERR3405312,ERR4311950,ERR3575704,SRR7053658,SRR7046340,ERR3404690,SRR7055919,ERR2761079,ERR3520153,SRR14026560,ERR2761763,SRR7054327,ERR2761620,ERR3575586,SRR7522215,SRR7039460,SRR7048201,SRR7217726,ERR3406167,ERR2761537,SRR7046453,SRR7053415,ERR2761720,SRR7058948,ERR3405162,SRR7044279,SRR6113198,SRR7217681,ERR3405372,ERR3405812,ERR3404762,ERR2761331,SRR7522229,ERR3405893,SRR7522292,ERR3405584,SRR7813406,ERR3405302,SRR7058845,ERR3406280,SRR7048193,ERR2761764,ERR3405429,SRR7217895,SRR7217707,SRR7056022,SRR7813468,ERR3404692,SRR7217604,SRR7056821,SRR7046437,ERR2761451,SRR7039429,SRR7044433,ERR3402827,ERR3402905,ERR3575750,SRR7053739,SRR7522221,ERR2272339,SRR7048021,ERR3406226,ERR3405358,SRR7056777,ERR3405816,SRR7521940,SRR7053437,ERR3405211,SRR7048140,SRR7053707,ERR3403131,ERR3402797,ERR3405482,SRR8942409,ERR4312166,ERR3405023,SRR7039453,SRR7053327,SRR7217894,ERR3405904,ERR2761088,SRR7217541,SRR7217632,SRR8356328,ERR2272354,ERR2761410,SRR7058863,SRR7054062,SRR7055958,SRR7056815,ERR2761419,SRR7046408,ERR3575757,SRR7044513,SRR7058903,SRR7054379,SRR7039456,SRR7046372,SRR7044453,SRR6674456,SRR7723020,ERR3405881,ERR2272330,SRR7217736,ERR3402816,SRR7041456,SRR7217586,SRR7521982,SRR12968055,ERR3405306,ERR3580053,ERR2761255,ERR2272283,SRR7041428,ERR3403157,SRR7047979,ERR3406089,SRR7058799,SRR7055942,ERR2761376,SRR7813487,SRR7522062,SRR7055952,ERR4312160,ERR3402995,ERR3403283,ERR5919896,SRR7039402,ERR3575473,ERR2761098,SRR7053964,SRR6674451,SRR8356313,SRR7046402,ERR3562625,SRR7522238,ERR3404876,ERR3405767,ERR3404800,SRR7041412,SRR7053925,ERR3575877,SRR7053342,SRR5533458,SRR7053505,ERR3575911,ERR3404912,SRR7058852,ERR3402781,ERR3403159,SRR7053888,ERR3402977,ERR3405389,ERR2761002,ERR3405081,ERR3520131,ERR3406208,ERR3562621,ERR3575924,SRR12665795,ERR3403018,ERR3575728,SRR7039459,ERR3405736,SRR7217858,ERR3579960,SRR7058745,SRR7053690,SRR7058901,ERR3519989,ERR2761124,ERR3405750,ERR2272345,ERR2761350,SRR7053890,SRR7056725,SRR7054299,ERR3405777,SRR12665724,ERR3405788,SRR7055961,SRR7048147,ERR3403020,ERR2761034,ERR2272274,ERR3405201,ERR3404907,SRR7053978,SRR12703000,SRR7217631,SRR7522127,ERR3520137,ERR3406230,SRR7046395,SRR7048159,ERR3405449,SRR7056012,SRR7522287,ERR2761146,SRR7041335,SRR7058867,ERR3575886,ERR2272391,SRR7217484,ERR3403211,ERR3404922,SRR7053963,ERR3575799,ERR2761567,ERR3575479,ERR3575749,ERR3405778,ERR3406088,SRR12665737,ERR3405722,SRR6674413,SRR7041445,SRR12665818,ERR3405887,SRR7039514,ERR3405405,SRR7053931,SRR8942356,ERR3403197,ERR3405838,ERR3404734,ERR3580062,ERR2761219,ERR3405787,SRR12968058,SRR7046327,SRR7041317,ERR3575804,ERR3404827,ERR3405958,ERR3520108,SRR7041446,ERR3575660,SRR7044394,SRR7039432,ERR3520182,ERR3403302,ERR3402808,SRR7053955,SRR7217811,SRR7813384,SRR7046314,SRR7054003,SRR7217902,SRR7047988,ERR3405406,SRR7053331,ERR3403307,ERR3406166,ERR3402569,ERR3406104,SRR7058768,SRR6674452,ERR3520005,SRR8356203,ERR3406017,SRR7039344,ERR3520089,ERR3405035,ERR3406211,ERR3402778,SRR7055836,SRR7058797,ERR3405019,ERR5919914,ERR3405185,SRR7039409,SRR8356202,SRR7813474,SRR7522296,ERR3520117,SRR7046237,ERR3405329,ERR3406137,ERR5919897,SRR7044527,ERR3405834,ERR3404727,SRR8942389,ERR2272341,SRR7217916,ERR3405889,SRR7522132,ERR2761564,ERR2761193,ERR5919939,SRR7055839,ERR3406025,SRR7522161,ERR3575532,SRR7039368,SRR7053450,ERR3404917,SRR7054398,ERR3402732,SRR7053356,SRR7046306,ERR2761003,ERR4311974,SRR8356318,SRR7217555,ERR3402733,SRR7054353,ERR3403361,ERR3402676,ERR3575826,ERR3403308,ERR3402836,ERR3575552,ERR3405941,SRR8942416,SRR7054202,SRR7056785,ERR3405756,SRR7046315,ERR3405576,SRR7058895,ERR3406086,ERR2272402,SRR7048066,SRR7056833,SRR7055861,SRR7053451,ERR3402697,ERR3519976,SRR8356358,ERR3575482,ERR2761316,SRR7522066,ERR2761051,ERR3405664,ERR2761095,ERR3520177,ERR2761529,ERR3406239,ERR3403007,ERR3406149,ERR2761742,SRR7046457,ERR3406021,SRR6113208,ERR2761716,SRR7044319,SRR7055880,ERR3402819,SRR7522310,SRR8942355,SRR7217694,SRR7041459,ERR2761245,ERR2761029,SRR6113194,SRR12665776,ERR3562664,SRR7217816,SRR7048082,ERR3402713,SRR5533477,SRR7053720,ERR3403310,SRR14026565,ERR3404931,ERR3520057,ERR3403331,SRR7039449,ERR2761752,ERR3404864,SRR7217638,SRR7048139,ERR3405867,SRR7723005,ERR3575534,SRR7046279,SRR7522307,SRR7723098,SRR12665709,ERR2761148,ERR3575459,ERR3402875,ERR3406249,ERR3406153,ERR3562659,ERR3403270,SRR7044467,ERR3575940,SRR7058769,ERR3405570,ERR3520125,ERR3403126,ERR4312141,SRR7044523,ERR2761228,ERR3405939,SRR7048100,SRR7046363,ERR3520115,ERR5919922,ERR3405699,ERR3575707,ERR3403108,SRR7048072,ERR3575616,SRR14026561,ERR3405524,ERR3405065,ERR3404909,ERR2761514,ERR3404934,ERR3562715,SRR7813370,ERR3576011,SRR7217482,ERR2761449,SRR7055857,SRR7048184,ERR3402610,ERR2761301,SRR7048033,ERR3405217,SRR7054042,ERR2761465,ERR3579974,ERR3404830,SRR7053511,ERR3402870,SRR7217812. Please try fetching those independently.
/Users/user1/q2_fondue/types/_format.py:31: DtypeWarning: Columns (30,48,49,50,59,66,72,89,97,134,259,269,307,309,311,312,313,314) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(str(self), sep='\t')
/Users/user1/q2_fondue/types/_format.py:31: DtypeWarning: Columns (30,48,49,50,59,66,72,89,97,134,259,269,307,309,311,312,313,314) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(str(self), sep='\t')
/Users/user1/q2_fondue/types/_format.py:31: DtypeWarning: Columns (30,48,49,50,59,66,72,89,97,134,259,269,307,309,311,312,313,314) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(str(self), sep='\t')
Saved SRAMetadata to: metadata.qza
Saved SRAFailedIDs to: failed_metadata.qza

Suggestion for resolving:

Add unit-test for checking that --o-failed-runs are filled when some metadata can't be fetched.

study_ids.tsv.zip

adamovanja commented 2 years ago

The cause of this error is that we are mixing up invalid_ids, IDs that are not valid, with missing_ids, IDs that could not be fetched with Efetch.

invalid_ids: https://github.com/bokulich-lab/q2-fondue/blob/fa38d26580c651640f160c76a3933a4188460150/q2_fondue/metadata.py#L76 vs. missing_ids: https://github.com/bokulich-lab/q2-fondue/blob/fa38d26580c651640f160c76a3933a4188460150/q2_fondue/metadata.py#L85-L87

After some investigation I found that when some run IDs are not fetched with Efetch, these missing_ids are simply lacking in the response received from Efetch and there is no error message attached to why these IDs were not fetched.

Hence, I see two options to go from here: 1) Drop the error message column in the failed_runs output artifact and delete the respective SRAFailedIDs type - essentially making failed_runs a NCBIAccessionIDs type. 2) Come up with a self-made error message (something like "No response from Efetch. Try again.") and keep respective SRAFailedIDs type for the failed_runs output artifact.

@misialq: Any preferences from your side?

misialq commented 2 years ago

I don't think we should go for option 1: that would be throwing out information that can be useful. If I encounter an error while fetching something I would still like to be able to look at what happened as in some cases there may be something I could do against it.

In the case of option 2: do you mean that the response here is empty?: https://github.com/bokulich-lab/q2-fondue/blob/fa38d26580c651640f160c76a3933a4188460150/q2_fondue/entrezpy_clients/_efetch.py#L529 There must be at least a status returned and, I'd expect, a reason too - although I think those would be contained in the raw_response rather: https://github.com/bokulich-lab/q2-fondue/blob/fa38d26580c651640f160c76a3933a4188460150/q2_fondue/entrezpy_clients/_efetch.py#L542-L543 in which case we should be able to get them out and use as respective error messages.

adamovanja commented 2 years ago

The reponse response.getvalue() contains the data obtained for the non-missing_ids and nothing for the missing_ids.

In the respective raw_response the status is "200" and the reason "OK". As data was fetched just not for all requested runIDs.

=> Fetching metadata for some run IDs from all the requested IDs does not return an error for the missing IDs.

adamovanja commented 2 years ago

Hence, the way I see it we only have the two options mentioned above.

misialq commented 2 years ago

Ohhhhhh, I remember now. 💡

So the reason why this is happening in the first place is that we are requesting way too many IDs and NCBI is just returning a subset (who knows why). Then we just repeat 20 times and hope we get them all. But if we don't, then we end up with some IDs still missing. This is the issue we already discussed and the solution to it is to change how we deal with those retries - I think it was this one: https://github.com/bokulich-lab/q2-fondue/issues/77.

Just to confirm, when you try it yourself, how many run IDs do you expect to fetch for your example dataset?

adamovanja commented 2 years ago

So, with the introduction of a retmax parameter we could reliably tell if data for all run IDs of a batch was fetched if not we would make Efetch return an error - which could then be appended to the invalid_runs error messages of this batch?

For the above example dataset, there is a total of 10'000 run IDs that needs to be fetched.

adamovanja commented 2 years ago

💡 ah, and I assume that these 10'000 run IDs are also set by the retmax parameter here https://github.com/bokulich-lab/q2-fondue/blob/fa38d26580c651640f160c76a3933a4188460150/q2_fondue/entrezpy_clients/_pipelines.py#L61-L65

misialq commented 2 years ago

Something like that but not quite. We should split all the IDs into small batches of size ~150-200 and loop over those (and set the retmax param to the batch size value). This way NCBI should always return the same number of IDs (equal to retmax in this case) and no IDs should be missing (unless there was an actual error, of course).

In other words, rather than looping 20 times and checking what is left (as we are doing now), we should first calculate those batches and loop over them instead (plus set the retmax). We can still fire a warning or error in case some IDs were missing, but technically it should not happen.

The retmax doesn't on its own guarantee that nothing would be missing - we need to combine it with small batches (as NCBI returns some random numbers if you request too many).

misialq commented 2 years ago

💡 ah, and I assume that these 10'000 run IDs are also set by the retmax parameter here

https://github.com/bokulich-lab/q2-fondue/blob/fa38d26580c651640f160c76a3933a4188460150/q2_fondue/entrezpy_clients/_pipelines.py#L61-L65

Yeah... But that doesn't work... 10000 is way too many.

adamovanja commented 2 years ago

what do you mean with "way too many"? So the dataset has 51 valid study IDs. These could theoretically be linked with >=10'000 run IDs, no?

misialq commented 2 years ago

Yes, I mean that you can also set a retmax=million but in reality what NCBI returns is a few hundred at best... That's why I wrote above that 150-200 should be a good number with which we can be certain they will always return all the requested data and there will be no missing IDs.

adamovanja commented 2 years ago

So, I added the retmax comment for fetching run IDs from other IDs to the issue #77.

I suggest to close this PR by making sure missing ids (not invalid ids) are being returned with a custom error message for now until we address the above issue. Do you agree?

misialq commented 2 years ago

Hmmm, actually, I'm not sure whether it makes sense to have a temporary fix as a workaround to that issue - let's just fix the other issue instead and all is solved, no? (we know how to fix it and it's actually not very difficult) When we solve #77 there should be no more missing IDs... Moreover, I'm now realizing that we should not return missing IDs instead of the invalid ones - if anything, they should be appended (we still want to show that there were some invalid ones, right?).

adamovanja commented 2 years ago

yes, agreed. sounds good 👍🏼

bokulich-lab / q2-fondue

`--o-failed-runs` remains empty after running `get-metadata` with missing run IDs #123