cafferychen777 / ggpicrust2

Make Picrust2 Output Analysis and Visualization Easier
https://cafferychen777.github.io/ggpicrust2/
MIT License
113 stars 13 forks source link

pathway_annotation() long runtime, connecting to KEGG database #19

Open erikpark opened 1 year ago

erikpark commented 1 year ago

Question: Is there a, general, expected runtime for the KEGG database connection step of the pathway_annotation() command?

This is my first time using the package, and I am passing (what I think is) a relatively small number of features, 228, to the annotation step - yet it's been running at the "We are connecting to the KEGG database to get the latest results, please wait patiently." step for ~6 hours.

If this runtime is expected, would it be possible to download the annotations ourselves and pass them to the annotation() command locally?

Thanks for any help you can provide!

cafferychen777 commented 1 year ago

The expected runtime for the KEGG database connection step of the pathway_annotation() command can vary depending on the number of features and the network conditions. However, it's not uncommon for this step to take several hours, especially if the query involves a large number of pathways or genes.

In your case, if you have passed 228 features and it has been running for about 6 hours, it's possible that the KEGG database API limits the rate of retrieval, which can slow down the process.

Regarding your second question, it's technically possible to download the annotations and pass them to the annotation() command locally, but it requires some manual work and may not be straightforward. Moreover, KEGG database downloading requires purchasing a license, which can be expensive.

If you're running this analysis in a lab setting and have access to resources, it may be worth considering purchasing a license to speed up the process. Alternatively, you can try reducing the number of features or exploring other options to optimize the query.

I hope this helps! Let me know if you have any further questions.

Erik Parker @.***>于2023年4月26日 周三05:33写道:

Question: Is there a, general, expected runtime for the KEGG database connection step of the pathway_annotation() command?

This is my first time using the package, and I am passing (what I think is) a relatively small number of features, 228, to the annotation step - yet it's been running at the "We are connecting to the KEGG database to get the latest results, please wait patiently." step for ~6 hours.

If this runtime is expected, would it be possible to download the annotations ourselves and pass them to the annotation() command locally?

Thanks for any help you can provide!

— Reply to this email directly, view it on GitHub https://github.com/cafferychen777/ggpicrust2/issues/19, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATZEQTUMW63P3KCENK6IIQTXDA7J7ANCNFSM6AAAAAAXLSD7MY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

erikpark commented 1 year ago

Thanks, this is helpful in narrowing down the issue!

I have left the process running for ~16 hours now, and it's still going. The R session has received about 113.4MB during this step so far, so it's definitely doing something, but it's going at a rate of ~1KB per second or so. To verify that it wasn't just my machine (a mac) that was the rate-limiting step I also started running the same code on a subset of features (5 randomly selected "significant" ones) on a separate windows PC I have access to at my office, so on a different internet connection, and this is still running at the same speed.

So I think you are correct that it is definitely the KEGG database itself which is limiting the traffic. It does also seem possible (based on my run on only 5 features) that the script is downloading the entire KEGG database each time - is that true?

In any case, do you know off the top of you head what a benchmark for this step should be in ideal conditions? I haven't heard any colleagues mention run-times approaching 24 hours, even when they ran larger datasets through this process.

Thanks again!

cafferychen777 commented 1 year ago

Hi,

Thank you for your email. I'm glad to hear that the suggestion I provided was helpful in narrowing down the issue.

To optimize the process, I suggest that you try splitting the daa_results_df data frame and adding a sleep function in between each download to prevent the rate-limiting issue. You can also try using parallel processing to speed up the download process.

Regarding the benchmark for this step, it is difficult to give an exact number as it depends on various factors such as internet speed, server load, and computer specifications. However, I would say that the download should not take more than a few hours. If you are experiencing run-times approaching 24 hours, there may be other underlying issues that need to be addressed.

I hope this helps, and please let me know if you have any further questions or concerns.

Best regards, Chen Yang

Erik Parker @.***>于2023年4月26日 周三19:59写道:

Reopened #19 https://github.com/cafferychen777/ggpicrust2/issues/19.

— Reply to this email directly, view it on GitHub https://github.com/cafferychen777/ggpicrust2/issues/19#event-9102710471, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATZEQTSDZHIZ2X3CYBTWKC3XDEEYJANCNFSM6AAAAAAXLSD7MY . You are receiving this because you commented.Message ID: @.***>

bingli2019 commented 9 months ago

when I use example strand data to run

but someting error connected to kegg database.

Starting pathway annotation... DAA results data frame is not null. Proceeding... KO to KEGG is set to TRUE. Proceeding with KEGG pathway annotations... We are connecting to the KEGG database to get the latest results, please wait patiently.

Processing pathways in chunks...

| | 0%Error in curl::curl_fetch_memory(url, handle = handle) : Failure when receiving data from the peer

cafferychen777 commented 9 months ago

Dear @bingli2019,

Thank you for using my R package and reporting this issue. The error you are seeing when connecting to the KEGG database is likely due to your local network conditions.

If you are in China, I would recommend using a VPN to access the KEGG databases outside of the Chinese internet firewalls. Many VPN services have free trials you could use just for accessing the databases.

If you are not in China, try switching to a different internet connection or network to rule out any issues with your current one. For example, use your phone's cellular data or connect to a different WiFi network.

The KEGG database connectivity can sometimes be sensitive to certain networks, likely due to bandwidth restrictions or connectivity issues. A VPN or alternate network often resolves this.

Please let me know if the issue persists after trying the above suggestions. I'm happy to help troubleshoot further. Thank you again for using my package!

Best regards, Chen YANG

bingli2019 commented 9 months ago

Thank you for your suggestion. I try to use vpn in this step, but failed again. I will try once. if have KEGG for local or somewhere to get, not to connecting to online database?