compomics / compomics-utilities

Open source Java library for computational proteomics
http://compomics.github.io/projects/compomics-utilities.html
29 stars 17 forks source link

Indexing by PeptideMapper #35

Closed andrewjmc closed 4 years ago

andrewjmc commented 4 years ago

Hello,

It seems that PeptideMapper reindexes a fasta file every time it is searched. Is this correct? And if so, is it intended?

Thanks,

Andrew

dominik-kopczynski commented 4 years ago

Hi Andrew, I guess you are using a PeptideShaker version 1.6.xx.? For this versions it is right, the index is always being computed on the fly. But apparently the projects and databases increase and this on-the-fly computation is not suitable any more. In the next mayor release of PeptideShaker 2.0.0, the index will be stored on disk and loaded when PS recognizes that you are using the same fasta file. We try to release PS 2.0 as soon as possible and hope to manage it in Q2 2020.

Cheers, Dominik

andrewjmc commented 4 years ago

Great, thanks! Any chance this might also include multi-threading for PeptideMapper (#34)?

This would make MetaNovo (https://github.com/uct-cbio/proteomics-pipelines/) more efficient.

Best wishes,

Andrew

dominik-kopczynski commented 4 years ago

Thank you for your suggestion. Actually, the mapping function is being called by PeptideShaker in multiple threads, but you are right, not in the standalone version. Will do it now, sounds like a good quick-to-realize improvement. Cheers

andrewjmc commented 4 years ago

Thanks Dominik. Slightly aside to this, also I saw that the core usage issue was indeed due to GC. I forced single-threaded GC and discovered the process was very slow. Checking the logs >90% of processor time was devoted to GC. I used an online tool and saw the stack usage increases linearly (from a high baseline due to tags and database, I assume) during the search phase, and I presume this is due to matches being stored in RAM before being written in one go at the end. If I'm right, in order to improve memory efficiency of PeptideMapper, it would be great if tags could be read in chunks and output could be written to disk.

In my use-case within Metanovo, the issue is trying to run one instance of PeptideMapper per core (48) while not exceeding RAM (100 Gb). I have had to creatively chunk up the fasta files (10,000 sequences) and tag files (2 million tags) to keep the memory usage within limits, with the small cost of repeated indexing of the FASTAs across 84 million tags,

Also, I'd love some advice how to get different versions of Compomics Utilities working as a .jar

I downloaded the latest version precompiled (4.8.3), but get this error:

Exception in thread "main" java.lang.NoClassDefFoundError: uk/ac/ebi/pride/tools/braf/BufferedRandomAccessFile
        at com.compomics.util.experiment.identification.protein_sequences.SequenceFactory.loadFastaFile(SequenceFactory.java:573)
        at com.compomics.util.experiment.identification.protein_inference.executable.PeptideMapping.main(PeptideMapping.java:56)
Caused by: java.lang.ClassNotFoundException: uk.ac.ebi.pride.tools.braf.BufferedRandomAccessFile
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

This is using OpenJDK 1.8.0_66

I have also tried downloading an older version of new_backend when you improved the handling of X residues. I compiled this with mvn (mvn package) - no prior experience or knowledge here. However, it did not seem to include the com.compomics.util.experiment.identification.protein_inference.executable.PeptideMapping function, perhaps because it's just the backend.

I have played with introducing the modified FMIndex.java into the latest version of the main branch, but this fails with errors about -source numbers and diamonds and things about which I have no idea :-)

dominik-kopczynski commented 4 years ago

Thank you, great idea, Andrew. And I see that you are encouraging us to think big, at least not in MB orders of magnitude but in GB++ dimensions ;-) Good attitude. BTW: in new_backend, PeptideMapper maps peptides and sequence tags now in parallel. Will later test how much faster it works on our 72core beast.

andrewjmc commented 4 years ago

Superb! Thanks. How does a java/maven novice get this new_backend and run PeptideMapping. And is there a command line option to choose number of threads?

dominik-kopczynski commented 4 years ago

Are you on Win / Linux / Mac? Which git tool RU using?

andrewjmc commented 4 years ago

I run stuff on linux-based HPC from a Windows machine. I have thus far avoided using any git tool, but nothing preventing me starting now!

dominik-kopczynski commented 4 years ago

So please go in your Linux machine on any directory and try the following commands:

git clone https://github.com/compomics/compomics-utilities cd compomics-utilities git checkout new_backend mvn install cd target/utilities-5.X.X-foo/ java -cp utilities-5.X.X-foo.jar com.compomics.cli.peptide_mapper.PeptideMapperCLI -p PATH/TO/FASTA_FILE.fasta PATH/TO/PEPTIDE_LIST.csv OUTPUT.csv

Of course, you have to adjust the last command and the 5.X.X. Feel free to use auto-completion when hitting double TAB key.

andrewjmc commented 4 years ago

Thanks, that was easier than I thought it might be. I now realise I had missed the different path to PeptideMapper (was using com.compomics.util.experiment.identification.protein_inference.executable.PeptideMapping).

This worked fine and coped with tags that had otherwise caused failures. Hooray!

If not too much, can we have a command-line parameter to choose number of threads? This will give me the maximum flexibility to use RAM and processors efficiently. Also within HPC environment there may often be more processors on the vnode than I have requested and am allowed to use. Thanks!

dominik-kopczynski commented 4 years ago

Sure, no problem, will code it today :-)

andrewjmc commented 4 years ago

Great, thanks!

dominik-kopczynski commented 4 years ago

Hey, scripting is done, code is pushed into new_backend branch. Peptide and sequence tags mapping happens now in parallel, you can choose the number of cores with an additional -c parameter at the end of the command.

Example java -cp utilities-5.X.X-foo.jar com.compomics.cli.peptide_mapper.PeptideMapperCLI -p PATH/TO/FASTA_FILE.fasta PATH/TO/PEPTIDE_LIST.csv OUTPUT.csv -c 4

By default, all available cores will be used. On my computer, there is a performance increase. However, please do not expect n times faster computation when using n cores, since all cores share the same index in the memory. I will close the thread now, but if you have some issues please don't hesitate to reopen this thread again :-) Cheers

andrewjmc commented 4 years ago

Perfect, thanks