KechrisLab / multiMiR

Development repository for the multiMiR database's R API
Other
19 stars 3 forks source link

Handle massive queries #35

Open JoseCorCab opened 4 years ago

JoseCorCab commented 4 years ago

I am trying to perform massive analysis of miRNA targets and I have found your tool very usefull. However, since I am working in a computation cluster, my cluster have limited time to execute at login (This is the only node wit access to internet) and your tool download the queries and format all data to multiMiR format for each database in the same job, I have a lot of troubles to make big queries or complete all the databases. I have tryed to split big queries to many small chunks, but the attempt was not successful. Can you recommend me any solution for my problem? I suggest you to divide get_multimir function in diferent modes: i) query all databases and save the raw data to a temporal folder ii) give to get_multimir the temporal folder and the function format this data to multiMiR layout instead of download the data. In this way, I can download the without the CPU workload and then process all the data in my cluster without interruption issues. Also, it would be a great feature, an user can make a generic query and download it. Then, make more specific queries onto the downloaded data tuning the parameters without stressing the databases APIs queried by your package so the user would identify faster its ideal setting.

smahaffey commented 4 years ago

Thank you @JoseCorCab. We will talk about creating a cached search that you can save/load/filter later. That's a good idea for large searches you can do more with the results without rerunning them. I think you might be able to accomplish this by running the searches repeatedly and joining the tables into a larger table that you can write to a file. You could do this ahead of time off of the cluster and write the table to a text file or .rData file that you can read when you start running on the cluster. This would effectively accomplish what you are requesting if I'm understanding correctly. If you have trouble with any of the queries timing out please provide an example and I will look into it. The data transfer can sometimes take extra time but any of the individual queries should be relatively quick. If they aren't quick then I need to also look at that problem.

For a custom query you should be able to submit one. Please look at the documentation for search_multimir(query). This should allow you to create your own queries and submit them. I understand the desire to create a local copy from those results that you can further query. I don't think it's technically challenging to implement this. I will bring this up as well. Thank you.

smahaffey commented 4 years ago

One other idea based on another request is maybe we can offer the option to store the result with a unique ID on the server so you could just retreive the results. However I think the transfer of a large query should usually be the slower part of the query.

JoseCorCab commented 4 years ago

Hello, thanks for the suggestion. My problem arises because a R loop is taken as a unique job in the cluster and the login node has a time limit for each job. I will try to make small individual queries (inside of a R script) inside a bash loop. In this way each query is a different job and it shouldn't break.

One other idea based on another request is maybe we can offer the option to store the result with a unique ID on the server so you could just retreive the results. However I think the transfer of a large query should usually be the slower part of the query.

I think that this in not a good option because it will overload your server and this can handicap other users.

Thanks for everything!!