carushi / ParasoR

Parallel solution for local RNA secondary structure analysis
https://github.com/carushi/ParasoR/wiki
GNU General Public License v2.0
10 stars 2 forks source link

Why does ParasoR Read Parameter Files (e.g., one of Turner Model) at Repository? #3

Closed heartsh closed 7 years ago

heartsh commented 7 years ago

Hi, the developer.

I'm using the software for getting base pairing probabilities of ncRNAs. One installed at my own user directory (e.g., under the directory $HOME/prgrms) uses the above files on the downloaded repository (e.g., under the directory $HOME/dwnlds/parasor). I think they should be installed into the same directory as the software. (For example, if I cleaned the repository, it wouldn't run properly saying there are missing files.) Or there are configurable settings for achieving it although I read the output from the command $ ./configure --help under the repository?

And another question: the software could use multithreading? (For example, when using the parameter "--pre", it is computing in parallel?) If not, since I think ParasoR is one of solutions of RNA secondary structure scalable to genome data, I'd like you to implement it.

carushi commented 7 years ago

Hi Heartsh,

Thank you for your interest in ParasoR! Unfortunately, I have fixed the location of "energy_param" folder at the same depth with "src" directory in ParasoR to make its directory structure easy to understand. Also, this is because ParasoR requires some directories for the storage of temporary files during parallel computation, such as "outer" and "prob", although it is unnecessary for non-parallel Rfold computation.

However, if you would like to refer "energy_param" in a variable location, there are three options to change the location of files as shown below.

1. Compile in $HOME/prgrms If you compile ParasoR in $HOME/prgrms, ParasoR automatically loads the parameters from $HOME/prgrms/energy_param/.

2. Edit param.hh This option is supposed to be simple. In param.hh file, the paths of energy parameter files are designated at L95-101. You can replace these lines with ones you like.


file = HOMEDIR+string("energy_param/rna_turner2004.par")

# Please be careful for a default set of Turner energy parameters, it is an old one.

3. Set --energy option every time Or the last option is that --energy option can be used to select any energy parameter file. It might be suitable if ParasoR is usually called from shell scripts.


ParasoR --energy $HOME/temp/energy_param/rna_turner2004_new.par

If these options do not suit your way, I would be glad if you tell me again!


Then, the next one about parallel computation, ParasoR applies SPMD technique (single program, multiple data). Although it is not based on multithreading, multiple jobs can be run on a single and long sequence. Here is an example of parallel computation using two processes.

ParasoR -i 0 -k 2 ... # Run in cluster 1
ParasoR -i 1 -k 2 ... # Run in cluster 2 

ParasoR --connect ... # Run after finishing of above two jobs.

ParasoR --stemdb -i 0 -k 2 --stdout ... # Run in cluster 1
ParasoR --stemdb -i 1 -k 2 --stdout ... # Run in cluster 2

(I am sorry for confusing you, but "--pre" option disables parallel computation!) For more info, please check ParasoR wiki. Because of such requirement of some fixed directory structures, I would like to recommend you to keep all of directories under $HOME/prgrms for parallel computation.

I would greatly appreciate it if you kindly give me some feedback. Sincerely,

Carushi

heartsh commented 7 years ago

@carushi OK, I somewhat understood using ParasoR.

About a directory for temporary files, you could use the directory "/tmp", the general directory for temporary files and directories in Linux filesystems. (For example, I often create directories such as "/tmp/ParasoR_{timestamp}" for generating temporary files in my multithreaded programs.)

As a remedy against the parameter files, I would modify some lines of "param.hh" as you show.

I realized you call one process (not thread) one cluster in the software. (When I saw the word "cluster" in the software, I supposed more complicated parallel computation such as one done with a "cluster" machine, for example, Hadoop and Spark.) Now think you suppose to use the software like SGE array jobs. (But if one who doesn't know about parallel computation/SGE array jobs, explanations in the software may lead to wrong recognition to the software.) Anyway, I'm gonna use the parallel computation.

Thank you for explaining the software.

carushi commented 7 years ago

Hi heartsh,

In my assumption, ParasoR repeatedly works based on computed "outer" files (sorry, they are not literally "temporary" ) for further analyses such as computation of multiple types of expected values and simulation of single point mutation. (Also I have a problem of keeping files in tmp dir in my supercomputing environment for some reason...) But, because others are required only for a while, I will think about it. Thank you for your suggestion!

In addition, I apologize for having caused the confusion. I'll add more detail to wiki about what techniques are applied for "parallel" computation in ParasoR. ParasoR does not require MPI, OpenMP, or consideration of mutex problems in multithreading because dynamic programming is completely divided into multiple tasks in the layer of its algorithm. Thus, I hope you would use it on large dataset effectively.

I appreciate your suggestion again, please let me know if you have any problem. Best,

carushi