PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 103 forks source link

/bin/bash: bad interpreter: Text file busy #269

Closed yfu closed 8 years ago

yfu commented 8 years ago

I have run fc_run.py with the same config file and the same input files several times, but every time, it returned the same error: "/bin/bash: bad interpreter: Text file busy". And it seems to be non-deterministic: jobs with different numbers giving the same error message.

I have attached part of the log below:

['/data/fuy2/FALCON-integrate/fc_env/bin/fc_run.py', 'fc_run.cfg', 'logging.ini']
2016-01-05 22:35:03,352 - fc_run - INFO - fc_run started with configuration fc_run.cfg
...
2016-01-05 22:35:07,704 - fc_run - WARNING - Call 'bash /data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0000/rj_0000.sh 1> /data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0000/rj_0000.sh.log 2>&1' returned 32256.
2016-01-05 22:35:07,704 - fc_run - ERROR - Contents of '/data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0000/rj_0000.sh.log':
cd /data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0000
+ cd /data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0000
trap 'touch /data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0000/job_0000_done.exit' EXIT
+ trap 'touch /data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0000/job_0000_done.exit' EXIT
ls -il rj_0000.sub.sh
+ ls -il rj_0000.sub.sh
772313525 -rw-r--r-- 1 fuy2 UNIXLDAPUsers 3294 Jan  5 17:35 rj_0000.sub.sh
hostname
+ hostname
ummsres17
chmod +x rj_0000.sub.sh
+ chmod +x rj_0000.sub.sh
touch rj_0000.sub.sh
+ touch rj_0000.sub.sh
ls -il rj_0000.sub.sh
+ ls -il rj_0000.sub.sh
772313525 -rwxr-xr-x 1 fuy2 UNIXLDAPUsers 3294 Jan  5 17:35 rj_0000.sub.sh
time ./rj_0000.sub.sh
+ ./rj_0000.sub.sh
/data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0000/rj_0000.sh: ./rj_0000.sub.sh: /bin/bash: bad interpreter: Text file busy
touch /data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0000/job_0000_done.exit
+ touch /data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0000/job_0000_done.exit
Traceback (most recent call last):
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/mains/run.py", line 66, in _run_script_local
    system(cmd, check=True)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/mains/run.py", line 23, in system
    raise Exception(msg)
Exception: Call 'bash /data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0000/rj_0000.sh 1> /data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0000/rj_0000.sh.log 2>&1' returned 32256.
2016-01-05 22:35:07,720 - fc_run - WARNING - Call 'bash /data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0006/rj_0006.sh 1> /data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0006/rj_0006.sh.log 2>&1' returned 32256.
2016-01-05 22:35:07,721 - fc_run - ERROR - Contents of '/data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0006/rj_0006.sh.log':
cd /data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0006
+ cd /data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0006
trap 'touch /data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0006/job_0006_done.exit' EXIT
+ trap 'touch /data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0006/job_0006_done.exit' EXIT
ls -il rj_0006.sub.sh
+ ls -il rj_0006.sub.sh
772313536 -rw-r--r-- 1 fuy2 UNIXLDAPUsers 4804 Jan  5 17:35 rj_0006.sub.sh
hostname
+ hostname
ummsres17
chmod +x rj_0006.sub.sh
+ chmod +x rj_0006.sub.sh
touch rj_0006.sub.sh
+ touch rj_0006.sub.sh
ls -il rj_0006.sub.sh
+ ls -il rj_0006.sub.sh
772313536 -rwxr-xr-x 1 fuy2 UNIXLDAPUsers 4804 Jan  5 17:35 rj_0006.sub.sh
time ./rj_0006.sub.sh
+ ./rj_0006.sub.sh
**/data/fuy2/cl/results/2016-01-05-falcon/ecoli-test/0-rawreads/job_0006/rj_0006.sh: ./rj_0006.sub.sh: /bin/bash: bad interpreter: Text file busy**
pb-cdunn commented 8 years ago

This happens when we run a script with a #! immediately after modifying (or initially generating) that script. It's related to NFS caching.

I used to see it, but in our case I've found a way to force the cache to sync first. I don't see it anymore myself.

I haven't found a truly reliable work-around yet. If we run via bash script.sh instead of ./script.sh (with shebang), then top will list the executable as bash rather than as the script-name, so it's hard to know which script is which process-number.

Could you tell me the commits you are using?

cd FALCON-integrate
git rev-parse HEAD
cd FALCON
git rev-parse HEAD
yfu commented 8 years ago

Thanks for the reply and for formatting the log!

Here are the versions:

(fc_env)$ cd FALCON-integrate/
(fc_env)$ git rev-parse HEAD
cd9e9373a9f897bc429ecf820809c6d773ee5c44
(fc_env)fuy2@ummsres17:$ cd FALCON
(fc_env)fuy2@ummsres17:$ git rev-parse HEAD
5942bc00d509e1b530c986a41eb13dc87b8fe5bb

And I noticed that if I set -s in pa_DBsplit_option to a large value, this error does not appear but if I set it to a small value, the error will appear. Maybe it has something to do with the number of concurrent jobs?

pb-cdunn commented 8 years ago

Job concurrency will impact when NFS decides to sync. You're using NFS, right?

Another "solution" is to run in /tmp, with job_type=local. Then, filesystem caching is not an issue.

I can work on this tonight. If I come up with a possible solution, will you test it for us on your system?

pb-cdunn commented 8 years ago

Try the tip of master for FALCON. If this works for you, I'll update FALCON-integrate. Otherwise, I'll revert.

Also, please try with avoid_text_file_busy=false. It is possible that I've entirely fixed the problem for you. It should be quick and easy for you to test, and it would help me a bunch.

yfu commented 8 years ago

I pulled the latest commit by going to the FALCON dir and git pull origin master, and I put the whole folder containing the data and config file in /tmp. When I ran the program, it still reported an error:

Traceback (most recent call last):
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 341, in __call__
    return self.runInThisThread(*argv, **kwargv)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 362, in runInThisThread
    self.run(*argv, **kwargv)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 276, in run
    rtn = self._runTask(self, *argv, **kwargv)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 190, in _runTask
    return self._taskFun(self)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/mains/run.py", line 181, in task_build_rdb
    support.build_rdb(**args)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/run_support.py", line 362, in build_rdb
    bash.write_script_and_wrapper(script, script_fn, job_done)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/bash.py", line 47, in write_script_and_wrapper
    make_executable(script)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/bash.py", line 17, in make_executable
    mode = os.stat(path).st_mode
OSError: [Errno 2] No such file or directory: 'fasta2DB -v raw_reads -f/tmp/2016-01-05-falcon/0-rawreads/input.fofn\n\nLB=$(cat raw_reads.db | awk \'$1 == "blocks" {print $3}\')\nHPCdaligner -v -dal4 -t16 -h1 -e.70 -l1000 -s1000 -M8 -H500 raw_reads 22-$LB >| /tmp/2016-01-05-falcon/0-rawreads/run_jobs.sh\n'

In the end, I changed line 47 of src/py/bash.py from make_executable(script) to make_executable(os.path.join(wdir, sub_script_bfn)) to fix this but then a new error occurred:

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/fuy2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/fuy2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 341, in __call__
    return self.runInThisThread(*argv, **kwargv)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 362, in runInThisThread
    self.run(*argv, **kwargv)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 276, in run
    rtn = self._runTask(self, *argv, **kwargv)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 190, in _runTask
    return self._taskFun(self)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/mains/run.py", line 183, in task_build_rdb
    job_type=config['job_type'], sge_option=sge_option_da)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/mains/run.py", line 144, in run_script_and_wait_and_rm_exit
    job_type, sge_option)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/mains/run.py", line 137, in run_script_and_wait
    run_script(job_data, job_type)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/mains/run.py", line 96, in run_script
    rc = _run_script(job_data)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/mains/run.py", line 66, in _run_script_local
    system(cmd, check=True)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/mains/run.py", line 23, in system
    raise Exception(msg)
Exception: Call 'bash /tmp/2016-01-05-falcon/0-rawreads/prepare_rdb.sh 1> /tmp/2016-01-05-falcon/0-rawreads/prepare_rdb.sh.log 2>&1' returned 256.

And it seems that prepare_rdb.sh is trying to repeatedly create a raw_reads.db file and causes fasta2DB to complain.

+ fasta2DB -v raw_reads -f/tmp/2016-01-05-falcon/0-rawreads/input.fofn
fasta2DB: File pb.test.fasta is already in database raw_reads.db

It does not have this error when I was using the default FALCON in FALCON-integrate. So... could this be a bug in the latest FALCON?

yfu commented 8 years ago

I forgot to mention that NFS is installed on the machine but all the files (FALCON-integrate, data and config files) are in local storage. Not sure if this will impact the concurrency problem...

yfu commented 8 years ago

I was able to run FALCON-integrate successfully by editing bash.py file in the default FALCON (5942bc0) in FALCON-integrate: changing line 48 from time ./{sub_script_bfn} to time bash ./{sub_script_bfn}. So... this confirms your thought: it is the difference between running /a.sh and bash ./a.sh

It is likely that the latest FALCON has problems with repeatedly running prepare_rdb.sh.

pb-cdunn commented 8 years ago

Sorry. I didn't have time to test before going home last night, but I'm glad you saw the basic idea. Yes, running bash script instead of ./script always solves the problem, but that's not the solution I wanted, since it then makes "bash" the program name, which makes different processes hard to distinguish.

I had hoped you could try the program with chmod applied by the Python code rather than by the bash wrapper. My theory is that, even though the Python script closes the filehandle for the generated script (since open() is in a with block), NFS caching somehow causes the sub-process not to see its release. So when the wrapper script runs chmod, the underlying script is still considered "open". But I cannot test this theory unless the problem is recurring repeatably. So I was really hoping you could test that.

pb-cdunn commented 8 years ago

Your problem with prepare_rdb.sh is probably only because of the way the previous run failed. If you rm -rf 0-*/ 1-*/ 2-*/, that should not happen.

Could you please try the latest HEAD with avoid_text_file_busy=false. I'd like to make that the default.

yfu commented 8 years ago

I just tried the latest HEAD and I did delete the previous failed run, but the same error still appeared:

Traceback (most recent call last):
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 341, in __call__
    return self.runInThisThread(*argv, **kwargv)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 362, in runInThisThread
    self.run(*argv, **kwargv)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 276, in run
    rtn = self._runTask(self, *argv, **kwargv)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 190, in _runTask
    return self._taskFun(self)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/mains/run.py", line 183, in task_build_rdb
    job_type=config['job_type'], sge_option=sge_option_da)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/mains/run.py", line 144, in run_script_and_wait_and_rm_exit
    job_type, sge_option)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/mains/run.py", line 137, in run_script_and_wait
    run_script(job_data, job_type)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/mains/run.py", line 96, in run_script
    rc = _run_script(job_data)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/mains/run.py", line 66, in _run_script_local
    system(cmd, check=True)
  File "/data/fuy2/FALCON-integrate/fc_env/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/mains/run.py", line 23, in system
    raise Exception(msg)
Exception: Call 'bash /tmp/2016-01-05-falcon/0-rawreads/prepare_rdb.sh 1> /tmp/2016-01-05-falcon/0-rawreads/prepare_rdb.sh.log 2>&1' returned 256.

I also tried the latest HEAD with avoid_text_file_busy=false and the same error with prepare_rdb.sh happened...

yfu commented 8 years ago

Oops... It is actually a different error in 0-rawreads/prepare_rdb.sh.log:

HPCdaligner -v -dal4 -t16 -h1 -e.70 -l1000 -s1000 -M8 -H500 raw_reads 22-$LB >| /tmp/2016-01-05-falcon/0-rawreads/run_jobs.sh
+ HPCdaligner -v -dal4 -t16 -h1 -e.70 -l1000 -s1000 -M8 -H500 raw_reads 22-
HPCdaligner: range 22-0 is empty or out of bounds

Maybe I set -s option to be too small in pa_DBsplit_option?

pb-cdunn commented 8 years ago

Hmm. Could you run DBstats 0-rawreads/raw_reads? Also cat 0-rawreads/raw_reads.db.

(Btw, I currently have a core-dump from DBstats, which I'm investigating... Ah, divide-by-zero. Unrelated bug.)

yfu commented 8 years ago

Maybe I did something wrong last time: I found a raw_reads.db file in the same parent dir of 0-rawreads, 1-preads_ovl, 2-asm-falcon. After I deleted raw_reads.db, I tried again with avoid_text_file_busy=true and now it works! You rock! :-)

However, with avoid_text_file_busy=false, FALCON still has the same problem of bad interpreter: Text file busy

Here is the DBstats output when avoid_text_file_busy=false:

$ DBstats 0-rawreads/raw_reads

Statistics for all wells of length 500 bases or more

          4,897 reads        out of           5,000  ( 97.9%)
     43,349,865 base pairs   out of      43,379,531  ( 99.9%)

          8,852 average read length
          6,226 standard deviation

  Base composition: 0.319(A) 0.175(C) 0.185(G) 0.320(T)

  Distribution of Read Lengths (Bin size = 1,000)

        Bin:      Count  % Reads  % Bases     Average
     35,000:          2      0.0      0.2       35737
     34,000:          2      0.1      0.3       35155
     33,000:          1      0.1      0.4       34738
     32,000:          2      0.1      0.6       34130
     31,000:          8      0.3      1.1       32753
     30,000:          5      0.4      1.5       32252
     29,000:          8      0.6      2.0       31482
     28,000:          5      0.7      2.4       31035
     27,000:          8      0.8      2.9       30322
     26,000:         14      1.1      3.7       29341
     25,000:         16      1.4      4.7       28454
     24,000:         26      2.0      6.1       27360
     23,000:         24      2.5      7.4       26576
     22,000:         41      3.3      9.5       25542
     21,000:         38      4.1     11.4       24755
     20,000:         40      4.9     13.3       24043
     19,000:         57      6.1     15.9       23170
     18,000:         82      7.7     19.4       22140
     17,000:         98      9.7     23.3       21184
     16,000:        156     12.9     29.2       20017
     15,000:        181     16.6     35.7       19009
     14,000:        243     21.6     43.8       17966
     13,000:        230     26.3     51.0       17170
     12,000:        238     31.1     57.8       16442
     11,000:        218     35.6     63.6       15821
     10,000:        213     39.9     68.8       15241
      9,000:        214     44.3     73.5       14676
      8,000:        235     49.1     78.1       14071
      7,000:        274     54.7     82.8       13396
      6,000:        266     60.1     86.8       12773
      5,000:        279     65.8     90.3       12143
      4,000:        288     71.7     93.3       11515
      3,000:        331     78.5     95.9       10820
      2,000:        366     86.0     98.0       10094
      1,000:        485     95.9     99.6        9201
          0:        203    100.0    100.0        8852
pb-cdunn commented 8 years ago

Ok, I'll let true be the default (sadly). Thanks for testing.