benedictpaten / marginAlign

UCSC Nanopore
MIT License
42 stars 13 forks source link

RuntimeError: Got failed jobs #40

Closed flokraft85 closed 5 years ago

flokraft85 commented 5 years ago

Hi, after starting marginAlign and running for several hours I got this error message. But I could not interpret what it meant.

mStart --logInfo --stats --maxThreads=28 --minimap2 Logging set at level: INFO Logging to file: None Logging set at level: INFO Logging to file: None The job tree appears to already exist, so we'll reload it Written the config file Setting up the thread pool with 28 threads given the max threads 28 and the max cpus 9223372036854775807 Using the single machine batch system Reloaded the jobtree Written the environment for the jobs to the environment file Got parameters,rescue jobs frequency: 5400.0 max job duration: 9.22337203685e+18 Checked batch system has no running jobs and no updated jobs Found 1 jobs to start and 0 parent jobs with children to run Starting the main loop The job seems to have left a log file, indicating failure: /home/flo/marginAlign/jobTree/jobs/job Reporting file: /home/flo/marginAlign/jobTree/jobs/log.txt log.txt: ---JOBTREE SLAVE OUTPUT LOG--- log.txt: Parsed arguments and set up logging log.txt: Traceback (most recent call last): log.txt: File "/home/flo/marginAlign/submodules/jobTree/src/jobTreeSlave.py", line 271, in main log.txt: defaultMemory=defaultMemory, defaultCpu=defaultCpu, depth=depth) log.txt: File "/home/flo/marginAlign/submodules/jobTree/scriptTree/stack.py", line 153, in execute log.txt: self.target.run() log.txt: File "/home/flo/marginAlign/submodules/jobTree/scriptTree/target.py", line 197, in run log.txt: func(*((self,) + tuple(self.args)), **self.kwargs) log.txt: File "/home/flo/marginAlign/src/margin/marginAlignLib.py", line 355, in realignSamFile3TargetFn log.txt: sum(map(lambda (type, length) : length if type in (0,1,4) else 0, aR.cigar)) log.txt: AssertionError log.txt: Exiting the slave because of a failed job on host ONT-Workstation log.txt: Due to failure we are reducing the remaining retry count of job /home/flo/marginAlign/jobTree/jobs/job to 0 log.txt: We have set the default memory of the failed job to 2147483648 bytes Job: /home/flo/marginAlign/jobTree/jobs/job is completely failed Only failed jobs and their dependents (1 total) are remaining, so exiting. Finished the main loop Waiting for stats collator process to finish Stats finished collating in 0.486446142197 seconds Traceback (most recent call last): File "./src/margin/marginAlign.py", line 109, in main() File "/home/flo/marginAlign/src/margin/marginAlign.py", line 105, in main raise RuntimeError("Got failed jobs") RuntimeError: Got failed jobs

Could anyone see, what's the problem and how I can solve it. Thanks in advance!

Best Florian

mitenjain commented 5 years ago

Hello,

Sorry for the delay. Are you deleting the jobTree folder between each run? If you don't, the new run will pick up from the previous jobTree, and hence will just be trying to continue where it left off (i.e. crashed) and crash again. As a quick test you could try running marginAlign with --minimap2 --noChain --noRealign option (this will run it as a stock minimap2 alignment).

Could you send me a working example dataset so I can replicate the issue at our end?

Thank you.

herrroaa commented 5 years ago

Hi,

I had the same exact error, after running my script for 12 hours. Did you manage to fix it?

Thanks

mitenjain commented 5 years ago

Hi both,

Is this happening when using chaining and realigning? If the test using --minimap2 --noChain --noRealign finishes then this would be a chaining issue (combined with the need for a trained R9.4 model) for realigning. I will put a new realignment model in a couple of days.

Another quick test, what is your reference file? If you created one yourself, you may want to check if it has a newline character at the end.

If you can send me an example small dataset to reproduce this error at our end, it will help.

Thank you.

flokraft85 commented 5 years ago

Hi Miten, sorry for the delayed response, I was a few days out of the lab. I use the hg19 ref file from UCSC. Yes, indeed it happened utilizing realignment. I will test the -noChain -noRealign options today. With deleting the old job tree folder, it process the alignment until the realignment job and then I got this error: Got message from job at time: 1532681815.85 : Going to realign sam file: /home/flo/marginAlign/jobTree/jobs/gTD1/tmp_n9dWqSD6Tx/temp.sam to create output sam file: /media/flo/SSD/SMN1/BC01_margin.sam with match gamma 0.5 and gap gamma 0.0 and model /home/flo/marginAlign/src/margin/mappers/last_hmm_20.txt Reissued any over long jobs Rescued any (long) missing jobs Reissued any over long jobs Rescued any (long) missing jobs Reissued any over long jobs Rescued any (long) missing jobs Reissued any over long jobs Rescued any (long) missing jobs Reissued any over long jobs Rescued any (long) missing jobs Reissued any over long jobs Rescued any (long) missing jobs Reissued any over long jobs Rescued any (long) missing jobs Reissued any over long jobs Rescued any (long) missing jobs Reissued any over long jobs Rescued any (long) missing jobs Reissued any over long jobs Rescued any (long) missing jobs Reissued any over long jobs Rescued any (long) missing jobs Reissued any over long jobs Rescued any (long) missing jobs Reissued any over long jobs Rescued any (long) missing jobs Reissued any over long jobs Rescued any (long) missing jobs Reissued any over long jobs Rescued any (long) missing jobs ... And when I tried a second time, I got the error from the first post. I put the file I used in this alignment here. https://gigamove.rz.rwth-aachen.de/d/id/TzLWLMDEcb5F4c Thanks in advance for any help!!!

mitenjain commented 5 years ago

Hi Florian,

Did the nochain norealign option work?

-Miten