EBI-Metagenomics / MGnifyR

R package for searching, downloading and analysis of EBI MGnify metagenomics data
https://ebi-metagenomics.github.io/MGnifyR/
Artistic License 2.0
19 stars 10 forks source link

doQuery() error with assemblies #27

Closed Cal-Thoma closed 4 months ago

Cal-Thoma commented 5 months ago

Hello, With the new version of this package I get an error when running "doQuery" when my qtype is assemblies.

    results = doQuery(client, 'assemblies', max.hits = 500, as.df=FALSE,
                                   accession = list(accession = accessions))

Error in match.arg(type, several.ok = FALSE) : 'arg' should be one of "studies", "samples", "runs", "analyses", "biomes" Calls: get_sample_ids -> doQuery -> doQuery -> .local -> match.arg Execution halted

Any help would be greatly appreciated.

Thank you, Cal Thoma UMN

TuomasBorman commented 5 months ago

Hi!

Thanks for posting this issue and sorry for the delay. The removal was not intentional.

We have been updating the package for easier maintenance and additional features. Even though the function names have changed, all the old functionality should still work. Any feedback is welcome!

BR, Tuomas Borman

Cal-Thoma commented 5 months ago

Hello, Thanks for the fix. I am finding another issue with doQuery(). For accessions of length 1, everything works as expected. However, longer accession length character lists do not work. doQuery() doesn't throw an error, but just ignores the accessions all together and gives a random set of metadata fitting my qtype. Thank you for your work thus far.

Best, Calvin Thoma

TuomasBorman commented 5 months ago

Hello!

Thank you so much again! I fixed the issue https://github.com/EBI-Metagenomics/MGnifyR/pull/31

The problem was that the results were not actually fetched one-by-one based on accession IDs. The results were tried to fetch in one go, which is not possible and accession IDs turned to be "accession1", "accession2", ..., "accessionn" which did not match the "accession" parameter in API.

Sorry for inconvenience. Please inform, if you find more issues. This information is very valuable.

-Tuomas

Cal-Thoma commented 5 months ago

Hi again, Thanks for the fast response! Multiple accessions seem to be working! Another issue I am finding now is a timeout issue. Have you found the performance of doQuery to be slower than the previous version?

Error: url_fetch_memory(url, handle = handle) : Timeout was reached: [www.ebi.ac.uk] Operation timed out after 60008 milliseconds with 0 bytes received

-Cal

TuomasBorman commented 5 months ago

I reopened the issue so that I remember to investigate this (it might take couple days).

The bottleneck of the function is the data fetching which is not significantly changed. The only difference is timeout parameter which is for specifying the waiting time in seconds.

doQuery(..., timeout = 60)

Your error comes from this command:

res <- GET("https://www.ebi.ac.uk/metagenomics/api/v1/studies", query = list(format = "json", accession = "MGYS00005292"), timeout(60))

Because in your error there is timeout after 0 bytes received, it implies that the server is not responding. However, there might be something in the MGnifyR package that is causing this issue so I will investigate this and come back to you as soon as possible.

-Tuomas

TuomasBorman commented 5 months ago

Hello @Cal-Thoma !

I wasn't able to find performance differences. However, I think I found something that might explain this.

You can utilize doQuery (and the previous function) in multiple ways. The best way to use the function depends on your goal. There are multiple parameters which you can use to do the query. See: https://ebi-metagenomics.github.io/MGnifyR/reference/doQuery.html

If you are fetching samples (type = "samples") accession parameter refers to sample ID. You can also fetch samples based on study ID. You can specify it by using parameter study_accession.

study_accession is passed directly to API, so you can only specify one study ID. However, by specifying it, you can fetch all the samples in the study.

You can also fetch the same samples by specifying sample IDs with accession parameter, but then we have to fetch samples one by one and it might take some time. Also, the server might start to respond slowly if there are lots of queries (though I'm not sure how they allocate resources to users).

mg <- MgnifyClient()

# Fetch samples based on sample IDs
sample_info <- doQuery(mg, "samples", accession = c("SRS3095329", "SRS3095328"))
sample_info

# Fetch samples based on study ID
sample_info2 <- doQuery(mg, "samples", study_accession = "MGYS00004521")
sample_info2

Does this answer to your question?

-Tuomas

Cal-Thoma commented 5 months ago

Hi @TuomasBorman Tuomas, I took a break from working on this and am coming back to it this week.

I have not experienced the same 0 bytes recieved however, the performance of doQuery is still noticeably different than the previous version of this package. The fix you described above will not work for me. Maybe I should explain what I am using this package for because I think it is fairly different.

I am trying to map metabolic networks in wastewater treatment plants across the globe. I have a database from MGnify with waste water proteins and after a blast search I have a list of those proteins and the Mgnify assembly accessions they correspond to. This could be a list of 500 accessions depending on how common the protein of interest is.

After searching for assembly metadata I have a dataframe with sample accessions, but not study accessions. I then use the sample accessions to search for sample metadata. I can then use this metadata to map samples where proteins of interest occur (or metabolic networks of interest).

I write this because I think I am pushing doQuery() a lot harder than your examples are. I think that the difference between this package and the previous version is your fix to fetching multiple accessions. Did the previous version need to query MGnify multiple times? I don't know if @beadyallen is still working on this package, but maybe he can confirm.

For now I think I will try to go back to using the previous version of this package. If the speed of the package is increased I will be happy to switch back!

Thank you, -Cal

TuomasBorman commented 5 months ago

Hmmmm, can you send me your exact query code (here or via email)? There is "no reason" why this updated version should be slower than the previous version. The core should still be mostly the same. Modifications that slows down the query are not intended.

Can you also give me the version numbers of the old faster package and the new slower one? I try to recreate this issue so that it would be easier to fix it.

-Tuomas

TuomasBorman commented 5 months ago

I was able to reproduce the issue: the new one took 6 times more time. -> I will come back to you when I have fixed the problem.

-Tuomas

TuomasBorman commented 4 months ago

Hello!

The performance of doQuery() is now significantly improved. doQuery() was made based on older mgnify_query() function. Now the function does not anymore loop through accession IDs; it retrieves all accession IDs at once. I noticed that this is possible based on your code, thanks for help.


library(MGnifyR)

id <- c("ERZ3455819", "ERZ3455853", "ERZ3455899", "ERZ3455898")
type <- "assemblies"

mg <- MgnifyClient()

res <- doQuery(mg, type, accession = id)

-Tuomas