analyzing large dataset analysis without LSF

yxxue commented 8 years ago

HI, Thanks for sharing the code, we have read your paper, excellcent work. We hope to use your methods to analysis our metagenomic datasets, but we meet some challenges: We only have 5 metagenoic samples, but each of them is quite big (Illimuna Hiseq, ~40Gb). I installed all the packages and ran the test data successfully. At first I tried to run the analysis referencing the demo scripts; it seems work and still running, but it's really slowly, only the first step 'create_hash' costs 3 days. I hope to use the parallel methods like LSF, but our cluster didnt support that, we just run the program directly. Could you help me how to run our large dataset faster and efficiently without LSF? (I think our cluster has enough CPUs,memories and storage to do high permance computing.)

brian-cleary commented 8 years ago

Hi,

Does your cluster support any sort of distributed computing, with grid engine, or some alternative perhaps? If so you should be able to just change the job submission scripts to fit your environment.

On the other hand, if you're only running on single instances, then I would stick with the same code as used in the test data (this is specifically designed to run on one machine). You'll change a few things to account for the difference in size from the test data: (1) make changes in params for hash size, cluster thresh, etc. according to the docs for running large data; (2) increase the number of cores used according to however many you have available on your machine.

I hope this helps. Please let me know if you have any more questions!

On Tue, Dec 1, 2015 at 9:05 AM, yaxin notifications@github.com wrote:

HI, Thanks for sharing the code, we have read your paper, excellcent work. We hope to use your methods to analysis our metagenomic datasets, but we meet some challenges: We only have 5 metagenoic samples, but each of them is quite big (Illimuna Hiseq, ~40Gb). I installed all the packages and ran the test data successfully. At first I tried to run the analysis referencing the demo scripts; it seems work and still running, but it's really slowly, only the first step 'create_hash' costs 3 days. I hope to use the parallel methods like LSF, but our cluster didnt support that, we just run the program directly. Could you help me how to run our large dataset faster and efficiently without LSF? (I think our cluster has enough CPUs,memories and storage to do high permance computing.)

— Reply to this email directly or view it on GitHub https://github.com/brian-cleary/LatentStrainAnalysis/issues/9.

yxxue commented 8 years ago

Hi, Thanks for your suggestions. Sorry, we dont have any alternatives, as not so many people use the cluster. I already finished HashCount.sh and KmerSVDClustering.sh, after create_hash, the rest step ran fast, now I'm running the ReadPartitioning.sh. I found that write_partition_parts.py need a huge storage, it has run 3 days and the tmp folder is already 1T, I'm not sure how many storage it will need, how to estimate it? if it keep running, I may have to kill the process coz we only have 500G storage left, or is there some ways to reduce the size of the tmp folder?

brian-cleary / LatentStrainAnalysis

analyzing large dataset analysis without LSF #9