Closed adamovanja closed 2 years ago
The cause of this error is that we are mixing up invalid_ids
, IDs that are not valid, with missing_ids
, IDs that could not be fetched with Efetch.
invalid_ids
: https://github.com/bokulich-lab/q2-fondue/blob/fa38d26580c651640f160c76a3933a4188460150/q2_fondue/metadata.py#L76
vs.
missing_ids
: https://github.com/bokulich-lab/q2-fondue/blob/fa38d26580c651640f160c76a3933a4188460150/q2_fondue/metadata.py#L85-L87
After some investigation I found that when some run IDs are not fetched with Efetch, these missing_ids
are simply lacking in the response received from Efetch and there is no error message attached to why these IDs were not fetched.
Hence, I see two options to go from here:
1) Drop the error message column in the failed_runs
output artifact and delete the respective SRAFailedIDs
type - essentially making failed_runs
a NCBIAccessionIDs
type.
2) Come up with a self-made error message (something like "No response from Efetch. Try again.") and keep respective SRAFailedIDs
type for the failed_runs
output artifact.
@misialq: Any preferences from your side?
I don't think we should go for option 1: that would be throwing out information that can be useful. If I encounter an error while fetching something I would still like to be able to look at what happened as in some cases there may be something I could do against it.
In the case of option 2: do you mean that the response here is empty?: https://github.com/bokulich-lab/q2-fondue/blob/fa38d26580c651640f160c76a3933a4188460150/q2_fondue/entrezpy_clients/_efetch.py#L529 There must be at least a status
returned and, I'd expect, a reason
too - although I think those would be contained in the raw_response
rather: https://github.com/bokulich-lab/q2-fondue/blob/fa38d26580c651640f160c76a3933a4188460150/q2_fondue/entrezpy_clients/_efetch.py#L542-L543 in which case we should be able to get them out and use as respective error messages.
The reponse response.getvalue()
contains the data obtained for the non-missing_ids
and nothing for the missing_ids.
In the respective raw_response
the status is "200" and the reason "OK". As data was fetched just not for all requested runIDs.
=> Fetching metadata for some run IDs from all the requested IDs does not return an error for the missing IDs.
Hence, the way I see it we only have the two options mentioned above.
Ohhhhhh, I remember now. š”
So the reason why this is happening in the first place is that we are requesting way too many IDs and NCBI is just returning a subset (who knows why). Then we just repeat 20 times and hope we get them all. But if we don't, then we end up with some IDs still missing. This is the issue we already discussed and the solution to it is to change how we deal with those retries - I think it was this one: https://github.com/bokulich-lab/q2-fondue/issues/77.
Just to confirm, when you try it yourself, how many run IDs do you expect to fetch for your example dataset?
So, with the introduction of a retmax
parameter we could reliably tell if data for all run IDs of a batch was fetched if not we would make Efetch return an error - which could then be appended to the invalid_runs error messages of this batch?
For the above example dataset, there is a total of 10'000 run IDs that needs to be fetched.
š” ah, and I assume that these 10'000 run IDs are also set by the retmax
parameter here https://github.com/bokulich-lab/q2-fondue/blob/fa38d26580c651640f160c76a3933a4188460150/q2_fondue/entrezpy_clients/_pipelines.py#L61-L65
Something like that but not quite. We should split all the IDs into small batches of size ~150-200 and loop over those (and set the retmax
param to the batch size value). This way NCBI should always return the same number of IDs (equal to retmax
in this case) and no IDs should be missing (unless there was an actual error, of course).
In other words, rather than looping 20 times and checking what is left (as we are doing now), we should first calculate those batches and loop over them instead (plus set the retmax
). We can still fire a warning or error in case some IDs were missing, but technically it should not happen.
The retmax
doesn't on its own guarantee that nothing would be missing - we need to combine it with small batches (as NCBI returns some random numbers if you request too many).
š” ah, and I assume that these 10'000 run IDs are also set by the
retmax
parameter here
Yeah... But that doesn't work... 10000 is way too many.
what do you mean with "way too many"? So the dataset has 51 valid study IDs. These could theoretically be linked with >=10'000 run IDs, no?
Yes, I mean that you can also set a retmax=million but in reality what NCBI returns is a few hundred at best... That's why I wrote above that 150-200 should be a good number with which we can be certain they will always return all the requested data and there will be no missing IDs.
So, I added the retmax comment for fetching run IDs from other IDs to the issue #77.
I suggest to close this PR by making sure missing ids (not invalid ids) are being returned with a custom error message for now until we address the above issue. Do you agree?
Hmmm, actually, I'm not sure whether it makes sense to have a temporary fix as a workaround to that issue - let's just fix the other issue instead and all is solved, no? (we know how to fix it and it's actually not very difficult) When we solve #77 there should be no more missing IDs... Moreover, I'm now realizing that we should not return missing IDs instead of the invalid ones - if anything, they should be appended (we still want to show that there were some invalid ones, right?).
yes, agreed. sounds good šš¼
When fetching metadata for a list of Study IDs with the action
get-metadata
a list of failed run IDs is printed to stdout but the returned Q2 artifact--o-failed-runs
remains empty.Steps to reproduce:
study_ids.tsv.zip
containing a .tsv file with study IDs and associated DOI names.NCBIAccessionIDs
artifact, withget-metadata
with thisNCBIAccessionIDs
artifact, with:Expected behaviour: The command returns
failed_metadata.qza
that contains the run IDs for which the fetching of metadata failed.Actual behaviour: The command runs without returning an error, printing the failed run IDs to stdout but not saving them to the
failed_metadata.qza
artifact.Suggestion for resolving:
--o-failed-runs
are filled when some metadata can't be fetched.study_ids.tsv.zip