compomics / searchgui

Highly adaptable common interface for proteomics search and de novo engines
http://compomics.github.io/projects/searchgui.html
40 stars 16 forks source link

Andromeda is slow in 2.0.4 in windows, low CPU and RAM utilization #59

Closed lparsons42 closed 8 years ago

lparsons42 commented 8 years ago

I've noticed that when I run searches in SearchGUI with multiple engines, the Andromeda part of the search seems to be the slowest and not for any obvious reason. If I watch it in task manager, "andromedacmd" is almost always at 0% CPU and using only a few MB worth of memory. On this system I have searches set to use 16 threads and Java set to use up to 40GB of RAM.

Is there anything I can do to speed up Andromeda? Is there a reason it is using so little CPU and RAM? I do have a javaw.exe and java.exe that are each using 25% CPU and ~5-6GB of RAM each which I presume relate to SearchGUI in general.

hbarsnes commented 8 years ago

The Andromeda support in SearchGUI is in a beta state and we are still getting used to it ourselves. Maybe Marc can provide you with more information (as he was the one implementing our Andromeda support), but as far as I understand we know very little about the inner workings of Andromeda and how to optimize its use.

I'd recommend contacting the Andromeda developers, see http://141.61.102.17/maxquant_doku/doku.php?id=maxquant:andromeda. For example there is a Google Group where you can ask these sort of questions.

And we'd of course be very happy to make any adjustments to our use of Andromeda in SearchGUI if you can find ways of speeding it up.

hbarsnes commented 8 years ago

One thing I forgot to mention, you could have a go at optimizing the advanced Andromeda parameters. These are available in SearchGUI either via the settings icon after the Andromeda line in the GUI, or via the command line options (http://compomics.github.io/compomics-utilities/wiki/identificationparameterscli.html#andromeda-advanced-parameters).

There is no guarantee that all combinations of the advanced settings will work though.

Finally, note that the actual Andromeda command line used is always printed to the SearchGUI log file (located in the SearchGUI resources folder), which should make it easier to communicate any issues/questions to the Andromeda Google Group.

mvaudel commented 8 years ago

Hi,

There is not much to add I am afraid, Andromeda was only recently made available outside MaxQuant and there seem to be room for improvement in terms of performance. The fact that it uses very little CPU during long periods seems to be due to indexing steps which cannot be run in parallel or are limited by the IO. We are working on it together with the developers and, as Harald mentioned, provide this implementation in beta only. Also, note that it is not granted that Andromeda will give you a substantial number of additional hits, in which case it is maybe not worth waiting? If you could test it we will be happy to have some feedback on this.

Best regards,

Marc

2015-09-25 19:43 GMT+02:00 Harald Barsnes notifications@github.com:

One thing I forgot to mention, you could have a go at optimizing the advanced Andromeda parameters. These are available in SearchGUI either via the settings icon after the Andromeda line in the GUI, or via the command line options ( http://compomics.github.io/compomics-utilities/wiki/identificationparameterscli.html#andromeda-advanced-parameters ).

There is no guarantee that all combinations of the advanced settings will work though.

Finally, note that the actual Andromeda command line used is always printed to the SearchGUI log file (located in the SearchGUI resources folder), which should make it easier to communicate any issues/questions to the Andromeda Google Group.

— Reply to this email directly or view it on GitHub https://github.com/compomics/searchgui/issues/59#issuecomment-143299657.

lparsons42 commented 8 years ago

Thank you for the reply Marc. That actually makes a lot of sense compared to what I have seen before with MaxQuant on its own. I actually included the Andromeda search just so I could compare the Andromeda runs to the other engines, and compare the Andromeda runs to MaxQuant runs of the same to see how it behaves in my hands. But in previous cases I do recall that building the search db for Andromeda was always quite slow.

It does appear that right now it rebuilds the db for each sample. Since one db is used for the whole run would it be possible to build that once, and then reuse it for subsequent searches? That would save a lot of time right there.

thank you, Lee

mvaudel commented 8 years ago

Hi Lee,

We would be very interested in the comparison of Andromeda in and outside MaxQuant. Obviously we cannot reach the same performance because spectra are not recalibrated, but it would be a good indicator of the quality of our implementation.

We tried to make it so that the index is built only once but this did not really work out and since the indexes are quite big we now delete them upon completion. I will see with Jürgen how we can improve this and come back to you.

Best regards,

Marc

2015-09-25 22:18 GMT+02:00 lparsons42 notifications@github.com:

Thank you for the reply Marc. That actually makes a lot of sense compared to what I have seen before with MaxQuant on its own. I actually included the Andromeda search just so I could compare the Andromeda runs to the other engines, and compare the Andromeda runs to MaxQuant runs of the same to see how it behaves in my hands. But in previous cases I do recall that building the search db for Andromeda was always quite slow.

It does appear that right now it rebuilds the db for each sample. Since one db is used for the whole run would it be possible to build that once, and then reuse it for subsequent searches? That would save a lot of time right there.

thank you, Lee

— Reply to this email directly or view it on GitHub https://github.com/compomics/searchgui/issues/59#issuecomment-143343859.

lparsons42 commented 8 years ago

Out of curiosity - and perhaps this should be a separate question - why does searchgui when presented with X files to run through Y engines do all Y engines for each spectra first rather than all X spectra for each engine? This seems particularly inefficient with the Andromeda search, where the database construction is so incredibly slow.

hbarsnes commented 8 years ago

Not sure if we ever considered the difference between the two options. However, doing it one mgf file at the time has some advantages in terms of the handling of the mgf files and the organization of the output. I don't think there will be big differences in terms of performance though.

Regarding the Andromeda database construction (and similar for other search engine preprocessing like the Tide indexing), this is only done once per search and is thus independent on the number of mgf files searched.

lparsons42 commented 8 years ago

The performance - or perhaps rather the lack thereof - of Andromeda suggests that the search database is being rebuilt with each run of Andromeda. Each time Andromeda starts there is a substantial lag period where CPU utilization is exceptionally low (essentially 0% on my system) and RAM is also barely being used (I was seeing 8MB or less from a system with over 60GB available for processes). This activity matches well the search database indexing / creation phase in MaxQuant. This phase has been taking several hours to build a database (which is built in far less time for other search algorithms).

mvaudel commented 8 years ago

Hi Lee,

As Harald stressed our implementation of Andromeda is still in beta and it will improve as we get our hands on the specificities of the algorithm. Iterating the files or the algorithms first will have only minor impact on the performance I am afraid since the performance is dependent on the number of time you run the tools and not their order. I have talked with Jürgen on this topic and we are looking together on how to get SearchGUI to reuse the indexes like done in MaxQuant. We will come back to you once a solution has been found :)

Best regards,

Marc

2015-09-29 16:00 GMT+02:00 lparsons42 notifications@github.com:

The performance - or perhaps rather the lack thereof - of Andromeda suggests that the search database is being rebuilt with each run of Andromeda. Each time Andromeda starts there is a substantial lag period where CPU utilization is exceptionally low (essentially 0% on my system) and RAM is also barely being used (I was seeing 8MB or less from a system with over 60GB available for processes). This activity matches well the search database indexing / creation phase in MaxQuant. This phase has been taking several hours to build a database (which is built in far less time for other search algorithms).

— Reply to this email directly or view it on GitHub https://github.com/compomics/searchgui/issues/59#issuecomment-144069521.

lparsons42 commented 8 years ago

I apologize if I came across harsh in my previous comment. I understand that the implementation is beta and I did see that in the previous comment. I look forward to hearing more on what you find, if there is anything I can do to help please let me know.

Prior to the newer versions of SearchGUI I have never used Andromeda outside of MaxQuant. I can tell you from my experiences with MaxQuant that the database construction is often the longest part of the search (time-wise); it also seems from my experience that MaxQuant has an unfortunate habit of always deleting its database after the run, leaving the user to reconstruct it with each subsequent search. I don't know if that habit carries over to Andromeda or not as I have not tried to run Andromeda outside of MaxQuant. That was the reason why I asked if it would be more useful - at least in the case of Andromeda - to have all the searches of the same engine done in succession as it could save database construction time.

Thank you! Lee

mvaudel commented 8 years ago

Hi again,

Sorry just noticed that I never came back to you on this. After discussing the issue with Jürgen (thanks for the help!), I have corrected our code and the indexes are now kept between mgf files in the latest versions. It is now much faster on my side, can you confirm that it works as wanted for you as well?

Best regards,

Marc

2015-09-30 21:23 GMT+02:00 lparsons42 notifications@github.com:

I apologize if I came across harsh in my previous comment. I understand that the implementation is beta and I did see that in the previous comment. I look forward to hearing more on what you find, if there is anything I can do to help please let me know.

Prior to the newer versions of SearchGUI I have never used Andromeda outside of MaxQuant. I can tell you from my experiences with MaxQuant that the database construction is often the longest part of the search (time-wise); it also seems from my experience that MaxQuant has an unfortunate habit of always deleting its database after the run, leaving the user to reconstruct it with each subsequent search. I don't know if that habit carries over to Andromeda or not as I have not tried to run Andromeda outside of MaxQuant. That was the reason why I asked if it would be more useful - at least in the case of Andromeda - to have all the searches of the same engine done in succession as it could save database construction time.

Thank you! Lee

— Reply to this email directly or view it on GitHub https://github.com/compomics/searchgui/issues/59#issuecomment-144513655.

lparsons42 commented 8 years ago

Marc

Sorry for my delay on getting back to you on this. I think Andromeda is running quicker on here than before, but still not as quick in SearchGUI as in MaxQuant (not sure if that is a reasonable goal or not). I have a job running right now in SearchGUI 2.1.4, and one thing I noticed is that the Andromeda search - which I believe shows up as "MsLibTask" in task manager - is still running as a single thread (I'm pretty sure MaxQuant will fork this part into multiple threads). The search I am running currently is using ~2GB of ram (I have allocated 40GB to SearchGUI, although of course some searches won't be able to use a ton of ram at any given moment).

The spectra that I have ran so far in this set have taken anywhere from ~1.5 to over 13(!) hours to run.

I don't know if there is any interest in this - and it may be counter to the goal of searchGUI any ways - but has anyone looked in to just importing the msms.txt file from MaxQuant into peptideshaker? I don't think it is currently capable of doing that, but the file is generally nicely formatted CSV and there could be some benefit to being able to do this (particularly for labs such as ours who have done a large number of MaxQuant jobs over the years and still have the results around).

thank you Lee

lparsons42 commented 8 years ago

I looked at how my current run is going - which is running only Andromeda through SearchGUI. It seems that my average run time per spectra will be somewhere around 2 hours. That isn't too bad, except I sent in some 38 spectra. This would give an expected total run around 3 days for all of this. I did run the same spectra through MaxQuant and it was done in an evening on the same system.

I'm not sure at this point where the speed advantage is coming from for MaxQuant. I did do the RAW->MGF conversion before hand, so that should not be handicapping SearchGUI as it can start with the MGF files instead of waiting for msconvert to do the conversion before starting (which could possibly add another day's worth of run time to the process when 38 spectra are involved).

mvaudel commented 8 years ago

Hi Lee,

The Andromeda we use is the exact same as in MaxQuant and we run it the same way. So the difference I can see is that MaxQuant does some preprocessing which might make the running faster. When running in MaxQuant, do you see only one MsLibTask using all CPU, or multiple instances sharing the CPU? I will ask Jürgen how he does it and to which extend we can reproduce it. Still, the search times are much longer than what we have on our side: Searching a standard Q Exactive run against a human database with standard search settings takes 6 min and not 2 hours. Would it be possible for you to give it a try using the files available here and send me the html report:

http://vedlegg.uib.no/?id=53b07a01c00ee7bca4f980001138b87c

Best regards,

Marc

2015-10-28 19:29 GMT+01:00 lparsons42 notifications@github.com:

I looked at how my current run is going - which is running only Andromeda through SearchGUI. It seems that my average run time per spectra will be somewhere around 2 hours. That isn't too bad, except I sent in some 38 spectra. This would give an expected total run around 3 days for all of this. I did run the same spectra through MaxQuant and it was done in an evening on the same system.

I'm not sure at this point where the speed advantage is coming from for MaxQuant. I did do the RAW->MGF conversion before hand, so that should not be handicapping SearchGUI as it can start with the MGF files instead of waiting for msconvert to do the conversion before starting (which could possibly add another day's worth of run time to the process when 38 spectra are involved).

— Reply to this email directly or view it on GitHub https://github.com/compomics/searchgui/issues/59#issuecomment-151942917.

lparsons42 commented 8 years ago

Marc

As soon as I have a system available I will run those. I can tell you that on our system currently I am seeing only one MsLibTask instance for the system, even though it has 8 physical cores available. Our data came from an orbitrap velos, which I presume would make generally comparable data to your Q Exactive - at least from the perspective of the data density seen by Andromeda.

thank you Lee

mvaudel commented 8 years ago

Hi again,

I was wondering, are you using the same search parameters for both tests? ie same database, same modifications (fixed and variable)? Having more variable modifications could explain the difference :)

Best regards,

Marc

2015-10-29 16:40 GMT+01:00 lparsons42 notifications@github.com:

Marc

As soon as I have a system available I will run those. I can tell you that on our system currently I am seeing only one MsLibTask instance for the system, even though it has 8 physical cores available. Our data came from an orbitrap velos, which I presume would make generally comparable data to your Q Exactive - at least from the perspective of the data density seen by Andromeda.

thank you Lee

— Reply to this email directly or view it on GitHub https://github.com/compomics/searchgui/issues/59#issuecomment-152219209.

lparsons42 commented 8 years ago

Marc

The database I'm using for the andromeda search is a different one. The parameters are similar, though. However perhaps more significantly, the search is dramatically quicker in MaxQuant for the same spectra, same db, and same parameters than it is in SearchGUI with Andromeda as the only search engine. The search I have running now on Andromeda, which I started on Tuesday, appears that it will likely run for ~5-6 days through searchGUI, it would only take overnight in MaxQuant.

thank you Lee

On 10/30/2015 02:36 PM, Marc Vaudel wrote:

Hi again,

I was wondering, are you using the same search parameters for both tests? ie same database, same modifications (fixed and variable)? Having more variable modifications could explain the difference :)

Best regards,

Marc

2015-10-29 16:40 GMT+01:00 lparsons42 notifications@github.com:

Marc

As soon as I have a system available I will run those. I can tell you that on our system currently I am seeing only one MsLibTask instance for the system, even though it has 8 physical cores available. Our data came from an orbitrap velos, which I presume would make generally comparable data to your Q Exactive - at least from the perspective of the data density seen by Andromeda.

thank you Lee

— Reply to this email directly or view it on GitHub

https://github.com/compomics/searchgui/issues/59#issuecomment-152219209.

— Reply to this email directly or view it on GitHub https://github.com/compomics/searchgui/issues/59#issuecomment-152629604.

mvaudel commented 8 years ago

HI Lee,

That is very strange, I don't get what is going wrong there. Would be nice if you could try the files I sent you. Would it be also possible to send me the .apar file generated by MaxQuant (should be in the Andromeda folder of MaxQuant) and the .par file used in SearchGUI?

Thank you,

Marc

2015-10-30 20:43 GMT+01:00 lparsons42 notifications@github.com:

Marc

The database I'm using for the andromeda search is a different one. The parameters are similar, though. However perhaps more significantly, the search is dramatically quicker in MaxQuant for the same spectra, same db, and same parameters than it is in SearchGUI with Andromeda as the only search engine. The search I have running now on Andromeda, which I started on Tuesday, appears that it will likely run for ~5-6 days through searchGUI, it would only take overnight in MaxQuant.

thank you Lee

On 10/30/2015 02:36 PM, Marc Vaudel wrote:

Hi again,

I was wondering, are you using the same search parameters for both tests? ie same database, same modifications (fixed and variable)? Having more variable modifications could explain the difference :)

Best regards,

Marc

2015-10-29 16:40 GMT+01:00 lparsons42 notifications@github.com:

Marc

As soon as I have a system available I will run those. I can tell you that on our system currently I am seeing only one MsLibTask instance for the system, even though it has 8 physical cores available. Our data came from an orbitrap velos, which I presume would make generally comparable data to your Q Exactive - at least from the perspective of the data density seen by Andromeda.

thank you Lee

— Reply to this email directly or view it on GitHub

<https://github.com/compomics/searchgui/issues/59#issuecomment-152219209 .

— Reply to this email directly or view it on GitHub <https://github.com/compomics/searchgui/issues/59#issuecomment-152629604 .

— Reply to this email directly or view it on GitHub https://github.com/compomics/searchgui/issues/59#issuecomment-152633520.

lparsons42 commented 8 years ago

Marc

I have started a search using the files in the archive you linked to, in SearchGUI 2.1.4. At this point it has been running more than 20 minutes, the last thing on the display is "Start Search". I see 4 tasks running called "MsLibTask", one of which is using 1.5-2 GB RAM with none of the others using more than 500MB. CPU utilization is still rather low; combination of the 4 tasks less than 20%. I have 40GB (forty) allocated to SearchGUI out of 60GB on the system. System is still running at this point, I will let you know what it reports when it finishes.

thank you Lee

lparsons42 commented 8 years ago

Final time for that search was listed as 23 minutes 31 seconds.

mvaudel commented 8 years ago

Hi Lee,

I am very confused, this search takes 2 min on my 3 years old laptop. There is definitely something slowering the process. Any idea on what could be different between the SearchGUI folder and the MaxQuant running folder? My best guess is that there is an io limitation somewhere. Do you have all files on the same disc? Can you make sure that they are not in synced folders, like personal folders mapped to a network or DropBox?

Best regards,

Marc

2015-11-03 21:40 GMT+01:00 lparsons42 notifications@github.com:

Final time for that search was listed as 23 minutes 31 seconds.

— Reply to this email directly or view it on GitHub https://github.com/compomics/searchgui/issues/59#issuecomment-153481398.

lparsons42 commented 8 years ago

I confirmed that the files - at least the mgf, the parameter file, and the database, are all on a drive that is local to the system. Does the number of threads (in this case, 4 even though searchgui is set to 8) match what you see on your system?

It does seem that there is something configured incorrectly on our end at this point, but I haven't yet figured out what.

mvaudel commented 8 years ago

Hi again,

It is also important that you verify that the SearchGUI folder is locally and not in a sync folder. This is the folder actually used by the search engines.

I am not sure what you mean with "the number of threads (in this case, 4 even though searchgui is set to 8)". I don't think that MSLibTask processes are single threaded. In any case you should not worry too much about the threads: using 8 threads instead of 4 will not even divide the search time by two - and here you should be able to divide it by 10.

Best regards,

Marc

2015-11-04 15:58 GMT+01:00 lparsons42 notifications@github.com:

I confirmed that the files - at least the mgf, the parameter file, and the database, are all on a drive that is local to the system. Does the number of threads (in this case, 4 even though searchgui is set to 8) match what you see on your system?

It does seem that there is something configured incorrectly on our end at this point, but I haven't yet figured out what.

— Reply to this email directly or view it on GitHub https://github.com/compomics/searchgui/issues/59#issuecomment-153753607.

lparsons42 commented 8 years ago

The file system is a little odd on the system that I am running this search on, so I just moved every file and path that I could find that relates to searchGUI to a drive that is definitely local to the system. I suspect one thing that may have made a difference here is that searchGUI wanted to use the home directory on this system - which is on a network volume - as a default temp space. Previously I had the searchGUI folder on a local volume but by default it wanted to use the home directory for temp space.

Now I can run the search in 12 minutes 58 seconds. Almost a 50% improvement over the previous number, though still more than twice the time you report from your laptop.

Throughout this search I saw mostly only one instance of MsLibTask running on the system (as reported by task manager).

mvaudel commented 8 years ago

Hi Lee,

Looks like we are going in the good direction :) Using folders synced to a network or file sharing systems can dramatically reduce the speed of the tool because the network tries to back-up changes. In the "Edit" -> "Resources Settings" menu you can set all the paths to temporary folders. Can you make a temporary folder on this local drive you have and set all paths to this folder (right-click on the table and select "set default path")? This will make sure that your user folder is not used. Note that it is the same with our other tools and especially PeptideShaker where the database indexes are in the user folder by default.

Hope that helps,

Best regards,

Marc

2015-11-04 18:26 GMT+01:00 lparsons42 notifications@github.com:

The file system is a little odd on the system that I am running this search on, so I just moved every file and path that I could find that relates to searchGUI to a drive that is definitely local to the system. I suspect one thing that may have made a difference here is that searchGUI wanted to use the home directory on this system - which is on a network volume - as a default temp space. Previously I had the searchGUI folder on a local volume but by default it wanted to use the home directory for temp space.

Now I can run the search in 12 minutes 58 seconds. Almost a 50% improvement over the previous number, though still more than twice the time you report from your laptop.

Throughout this search I saw mostly only one instance of MsLibTask running on the system (as reported by task manager).

— Reply to this email directly or view it on GitHub https://github.com/compomics/searchgui/issues/59#issuecomment-153799903.

lparsons42 commented 8 years ago

Marc

As another follow-up, I reran a search today, but made one more adjustment that further improved the speed of Andromeda searches (after the first spectra, that is*). The adjustment I made this time was to move the Andromeda temp folder from the "default" location (as stated in parameters) to a location that was definitely local to the system. Now some searches complete in 10 minutes or less.

This leads me to a question, would it be sensible for future versions of SearchGUI to use directories that are subdirectories of the SearchGUI folder for their temp space? It seems that SearchGUI - especially when considering Andromeda or Comet - wants to always use space within the user's home directory for temp space for these, while the other engines will automatically set up under the directory where SearchGUI is installed.

thank you Lee

mvaudel commented 8 years ago

Hi Lee,

Do I understand correctly that we managed to reduce your search time from several hours/days to less than ten minutes for the subsequent files? This is great news!

In general we try to keep search engines away the user folder, but it can be that our code defaults there at some point, will check this in details for the next version. It seems that this is still the case for your first search, I will make sure that it gets corrected.

Best regards and many thanks for all your feedback and testing,

Marc

2015-11-12 20:51 GMT+01:00 lparsons42 notifications@github.com:

Marc

As another follow-up, I reran a search today, but made one more adjustment that further improved the speed of Andromeda searches (after the first spectra, that is*). The adjustment I made this time was to move the Andromeda temp folder from the "default" location (as stated in parameters) to a location that was definitely local to the system. Now some searches complete in 10 minutes or less.

This leads me to a question, would it be sensible for future versions of SearchGUI to use directories that are subdirectories of the SearchGUI folder for their temp space? It seems that SearchGUI - especially when considering Andromeda or Comet - wants to always use space within the user's home directory for temp space for these, while the other engines will automatically set up under the directory where SearchGUI is installed.

  • This is one issue I can't seem to get to the bottom of. The first search of a set is always really, really, slow. I know part of it is building the search database, but even with your files it is really slow. A set I am running currently took ~ 1hour 20 minutes for file 1, while one of the later files was done in seven minutes.

thank you Lee

— Reply to this email directly or view it on GitHub https://github.com/compomics/searchgui/issues/59#issuecomment-156215980.

lparsons42 commented 8 years ago

Indeed, things are working a lot better now. I hope that my comments have been helpful in some way, as I do think the work from your group is very valuable to the proteomics community. We are all aware of the challenges of processing mass spec data - including those challenges that come from each manufacturer having their own closed data format - so I see your project as being a great step towards equalizing some of the problems that come from this. I really appreciate the addition of the Andromeda algorithm in particular as its general fusion to MaxQuant was occasionally an obstacle in our group.

Thank you! Lee

mvaudel commented 8 years ago

Dear Lee,

Thank you for your kind words. If you agree I will close this issue now, if the problem is not solved just ask and I will reopen it. And should you have troubles with our tools don't hesitate to open a new thread!

Best regards,

Marc