cansyl / ECPred

GNU General Public License v3.0
15 stars 7 forks source link

command line tool upgrades? #1

Closed nextgenusfs closed 5 years ago

nextgenusfs commented 5 years ago

Hello, thanks for making your tool for assigning EC numbers available. I've tried the command line version and while it does seem to work, there are a number of issues that make it not very practical to use.

1) there is no (obvious) way to save the results to a file, they seem to get dumped in the ECPred directory. 2) the 20 sequence limit is not scalable to genome searches for example, I've tried to split a larger fasta file into chunks of 20 sequences and then launch those jobs in parallel, however, it seems that the program only then outputs the last result -- related to #1 above, the program should have the ability to save to a desired file name/location. 3) I cant seem to be able to call the ECPred.jar file from any other directory than the install directory, which makes trying to run this tool on a fasta file quite difficult. 4) there is no help menu on the command line, so no way for user to know what options are available.

Again, I don't mean to sound like I'm complaining -- but I'd like to be able to test your tool at assigning EC numbers to annotated genomes -- but in its current format I'm not sure it is flexible enough to be useful for that function.

Thanks, Jon

alperendalkiran commented 5 years ago

Dear Prof.Palmer,

Thank you for your interest to ECPred and for your kind comments. I have made a couple of changes so that ECPred would be of practical to use. Many of these changes overlap with your comments.

Since Gmail antivirus scanner doesn't allow to send an email with a jar file, I have put the jar file to my dropbox so that you can download. Here is the direct download link for ECPred, https://www.dropbox.com/s/5xr7w9iwp8c3ok9/ECPred.jar?dl=1

1- You can now save your results in any directory. I've added two extra arguments| the first one is to define a temporary directory and the second one is to give the name of an output directory. SVM models, Pepstats results and etc are stored in the temp directory. In the original version, all files in this directory were automatically deleted. In this version, you would delete all files in this directory by manually. When typing the output file directory, you can also specify the output file name and output file extension. 2- I have removed the 20 proteins restriction, now you can test an unlimited number of proteins. But, 100 proteins for one job should be enough. Since one protein is being predicted usually around 1-2 minutes depending on your system performance. 3- You can now run ECPred.jar from any directory. 4- I have added an example usage of ECPred in the jar file which is also available in below.

Example usage java -jar ECPred.jar inputFileName.fasta tempDir outputFile

Here, only the outputFile is optional. If you don't specify the output file name the results will be printed to standard output.

If you have further comments and questions, we will be glad to have them.

Sincerely, Alperen Dalkıran

Jon Palmer notifications@github.com, 3 Kas 2018 Cmt, 22:45 tarihinde şunu yazdı:

Hello, thanks for making your tool for assigning EC numbers available. I've tried the command line version and while it does seem to work, there are a number of issues that make it not very practical to use.

  1. there is no (obvious) way to save the results to a file, they seem to get dumped in the ECPred directory.
  2. the 20 sequence limit is not scalable to genome searches for example, I've tried to split a larger fasta file into chunks of 20 sequences and then launch those jobs in parallel, however, it seems that the program only then outputs the last result -- related to #1 https://github.com/cansyl/ECPred/issues/1 above, the program should have the ability to save to a desired file name/location.
  3. I cant seem to be able to call the ECPred.jar file from any other directory than the install directory, which makes trying to run this tool on a fasta file quite difficult.
  4. there is no help menu on the command line, so no way for user to know what options are available.

Again, I don't mean to sound like I'm complaining -- but I'd like to be able to test your tool at assigning EC numbers to annotated genomes -- but in its current format I'm not sure it is flexible enough to be useful for that function.

Thanks, Jon

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cansyl/ECPred/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOqD3XB5LEKs4YG7wWJtv5_ZdiBb67tks5urfJLgaJpZM4YM941 .

nextgenusfs commented 5 years ago

Fantastic! Thanks for the quick turn-around time. I will give it a try today and let you know how it is working.

nextgenusfs commented 5 years ago

After a quick test, it seems that I'm getting errors when trying to execute from a different directory, here is command and the output:

$ java -jar /path/to/ECPred/ECPred.jar chunk_1.fasta chunk1 chunk_1.pred
Main classes of input proteins are being predicted ...
Exception in thread "main" java.nio.file.NoSuchFileException: lib/EC/1.-.-.-/spmap/profile.txt
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
        at java.nio.file.Files.newByteChannel(Files.java:361)
        at java.nio.file.Files.newByteChannel(Files.java:407)
        at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
        at java.nio.file.Files.newInputStream(Files.java:152)
        at java.nio.file.Files.newBufferedReader(Files.java:2784)
        at java.nio.file.Files.readAllLines(Files.java:3202)
        at java.nio.file.Files.readAllLines(Files.java:3242)
        at seq2vectPSSMtest.calculateVectors(seq2vectPSSMtest.java:114)
        at predictBatchSPMAP.main(predictBatchSPMAP.java:44)
        at runEC.predictions(runEC.java:26)
        at ECPpred.main(ECPpred.java:111)

Looks like it isn't finding the lib directory? Note that it does run if launched from the ECPred directory.

It is creating the temp directory, i.e.:

$ ls
chunk1  chunk_1.fasta  
alperendalkiran commented 5 years ago

Yes, it can't find the lib directory. Now, I have added one more argument for the lib directory. It should work now. Here is the sample usage,

java -jar ECPred.jar inputFile libraryDir tempDir outputFile

For example, java -jar /path/to/ECPred/ECPred.jar Desktop/ chunk_1.fasta chunk1 chunk_1.pred Another example, java -jar /path/to/ECPred/ECPred.jar /path/to/lib/ chunk_1.fasta chunk1 chunk_1.pred

Here, the output file is still optional, but you need to specify the lib directory. In the above example, my lib file is located under Desktop. Important note: you need to put / (File separator) after lib directory. Otherwise, the program can't figure out the path.

Here is the direct link for this version of ECPred, https://www.dropbox.com/s/5xr7w9iwp8c3ok9/ECPred.jar?dl=1

Could you please tell me if this version runs without error?

Thank you, Alperen Dalkıran

Jon Palmer notifications@github.com, 6 Kas 2018 Sal, 18:23 tarihinde şunu yazdı:

After a quick test, it seems that I'm getting errors when trying to execute from a different directory, here is command and the output:

$ java -jar /path/to/ECPred/ECPred.jar chunk_1.fasta chunk1 chunk_1.pred Main classes of input proteins are being predicted ... Exception in thread "main" java.nio.file.NoSuchFileException: lib/EC/1.-.-.-/spmap/profile.txt at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) at java.nio.file.Files.newByteChannel(Files.java:361) at java.nio.file.Files.newByteChannel(Files.java:407) at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384) at java.nio.file.Files.newInputStream(Files.java:152) at java.nio.file.Files.newBufferedReader(Files.java:2784) at java.nio.file.Files.readAllLines(Files.java:3202) at java.nio.file.Files.readAllLines(Files.java:3242) at seq2vectPSSMtest.calculateVectors(seq2vectPSSMtest.java:114) at predictBatchSPMAP.main(predictBatchSPMAP.java:44) at runEC.predictions(runEC.java:26) at ECPpred.main(ECPpred.java:111)

Looks like it isn't finding the lib directory? Note that it does run if launched from the ECPred directory.

It is creating the temp directory, i.e.:

$ ls chunk1 chunk_1.fasta

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cansyl/ECPred/issues/1#issuecomment-436291217, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOqD_Y13MBa3xwB3BgLNBlWD4PHcFkvks5usalggaJpZM4YM941 .

nextgenusfs commented 5 years ago

Thanks @alperendalkiran, updated jar file is working as you describe -- thanks again for quick turn-around time. You probably could avoid having to have the lib option on the command line, assuming that users have ECPred.jar in same location as the lib/ directory, then could parse the location of ECPred.jar, i.e. either from command line or java equivalent of which. Same could be true of the temp directory, i.e. could store in unix default /tmp dir and delete after running the analysis. You may want to also take advantage of the github releases for releasing the jar file, then perhaps have an install script that downloads the necessary "data"? Of course, just a suggestion.

My use case here is to run this on a fungal genome, i.e. ~ 10,000 proteins. Speed is certainly an issue as you mention the searches take a bit of time. My approach will be to split the input file into chunks and then launch ECPred in parallel, i.e. 32 processes. Any improvements in run time would certainly make your tool more functional for a wide-array of uses. Again, thanks for making your code available.

nextgenusfs commented 5 years ago

Actually another question for you, is there a way using your lib data to do a preliminary blast search to filter which proteins that I would want to run ECPred on? I'm thinking that could significantly reduce runtime, if I could run a prefilter to only run ECPred on proteins that have loose homology to any of the enzyme classes.

alperendalkiran commented 5 years ago

Thank you for your suggestions. We will definitely consider your suggestions.

I've updated ECPred.jar file according to your request. Now, you can specify which method you want to run. I've added method argument which is mandatory. If you run Blast, the time reduces significantly. Here, is the usage,

java -jar ECPred.jar method inputFile libraryDir tempDir outputFile method argument can be one of the followings: blast, spmap, pepstats, weighted outputFile is optinal. If you don't specify the output file name the results will be printed to standard output.

For example, java -jar /path/to/ECPred/ECPred.jar blast chunk_1.fasta /path/to/lib/ chunk1 chunk_1.pred or java -jar /path/to/ECPred/ECPred.jar weighted chunk_1.fasta /path/to/lib/ chunk1 chunk_1.pred

Here is the direct link for this version of ECPred, https://www.dropbox.com/s/5xr7w9iwp8c3ok9/ECPred.jar?dl=1

Sincerely, Alperen Dalkıran

Jon Palmer notifications@github.com, 6 Kas 2018 Sal, 19:58 tarihinde şunu yazdı:

Actually another question for you, is there a way using your lib data to do a preliminary blast search to filter which proteins that I would want to run ECPred on? I'm thinking that could significantly reduce runtime, if I could run a prefilter to only run ECPred on proteins that have loose homology to any of the enzyme classes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cansyl/ECPred/issues/1#issuecomment-436327454, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOqD5VFqMqSoe1qt2MWJ31NZECDVx7dks5usb-rgaJpZM4YM941 .

nextgenusfs commented 5 years ago

Thanks, so what is the default search methodology in previous versions?

alperendalkiran commented 5 years ago

The default method is a combination of three independent methods called weighted. Three independent methods are: blast which is based on homology, spmap is based on subsequence and pepstats is based on physicochemical features.

You may check our paper for more detail about methods, https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2368-y

Jon Palmer notifications@github.com, 6 Kas 2018 Sal, 21:22 tarihinde şunu yazdı:

Thanks, so what is the default search methodology in previous versions?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cansyl/ECPred/issues/1#issuecomment-436355655, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOqD2PfTlm8-nfs9T0pWMm5Dw5jw8Lgks5usdNMgaJpZM4YM941 .

nextgenusfs commented 5 years ago

Perfect, thanks. So I can do a quick pass using blast and then take those hits to the weighted, that might speed up the whole genome search. Is there a chance that I would miss hits using the blast first that would be picked up by the other methods? If so, I can certainly let it run weighted on everything -- I'd rather it be correct than faster and not as accurate.

alperendalkiran commented 5 years ago

Actually Blast knn's performance alone is quite good; however, It is possible that blast misses a few hits that the other methods catch. If you wish to obtain the absolute best predictive performance (in terms of both recall and precision) we suggest you to use the weighted version. Also, you may try to run blast and weighted versions on the same small set to compare them with each other.

Jon Palmer notifications@github.com, 6 Kas 2018 Sal, 21:58 tarihinde şunu yazdı:

Perfect, thanks. So I can do a quick pass using blast and then take those hits to the weighted, that might speed up the whole genome search. Is there a chance that I would miss hits using the blast first that would be picked up by the other methods? If so, I can certainly let it run weighted on everything -- I'd rather it be correct than faster and not as accurate.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cansyl/ECPred/issues/1#issuecomment-436368159, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOqD_U7HB7VaNaF3FdgM0t8GQTovNDqks5usdvHgaJpZM4YM941 .