Closed gotgenes closed 12 years ago
Ah, yes, that's correct. normally, I use
$(command)
but that gives problems after commenting out with heredocs so I removed the $.
I'll make the other change you suggested as well.
thanks again. if you have any troubles with installation, let me know. I'd like that to be as smooth as possible.
Actually, there's something going awry still in that for loop for me. The $start
and $end
values are identical, I wind up submitting a tremendous number of jobs, and all are failing in the R call, in which I'm getting
Error in Ops.factor(df$SampleID, data$ID) :
level sets of factors are different
Calls: stopifnot -> Ops.factor
Execution halted
Error in Ops.factor(df$SampleID, data$ID) :
level sets of factors are different
Calls: stopifnot -> Ops.factor
Execution halted
@gotgenes could you check out the latest revision. that was another case of me pulling out the parens.
That does balance out the $start
and $end
issue.
By the way, the here-documents are unbalanced at this point, since <<NORMALIZE
is commented out (but not it's matching end string). Also, is there a way to only run the portion denoted in a here-document? I wasn't sure what you meant by
Ah, yes, that's correct. normally, I use
$(command)
Yeah, I just usually comment and uncomment the heredocs. I just submitted a revision so you can do:
bash run.sh FIT
to run that section. It might be a bit more pleasant to use. Mostly I just want the run.sh to demonstrate the work-flow.
Re $(command) I was meaning the equivalence of
files=`ls`
and
files=$(ls)
except the latter sometimes causes problems within heredocs like <<NORMALIZE.
In the FIT section, I'm still stuck with
Error in Ops.factor(df$SampleID, data$ID) :
level sets of factors are different
Calls: stopifnot -> Ops.factor
Execution halted
OK. there's an errant space in the csv file. If you remove it, then the matching should go ok. I added a quick perl command within FIT to remove it.
Thanks, Brent.
Another question: do the fit.lm.R jobs really consume 7GB of RAM each?
They shouldn't, but I was doing that as a quick way to limit the number of jobs per node. If there are too many, it hits the disk pretty hard and things grind to a halt. I'm open to a better way to limit.
It looks like they only take a little over 200 MB of RAM, actually, on our x86_64 cluster.
Also, can I suggest
perl -pi -e 's/,\s/,/' data/natgen2009.csv
instead? Without the commas, it would put all the data on one line if re-run. Actually, maybe this line should go into the GET, so it really will be run only once.
I'm open to a better way to limit.
I'm running this on a Sun (Oracle) Grid Engine cluster with qsub
instead of LSF with bsub
. The way we have to limit the jobs is to use a flag -l ul1=<number>
(or user_limit1
instead of ul1
). We each get 1000 user limit "shares", so setting -l ul1=10
gets a maximum of 100 jobs going. I'm not sure what the bsub
equivalent is, but the documentation suggests that when we had LSF there was some sort of "'limited' queue" used to keep too many jobs from running. Not quite sure what that means.
Well, I've gotten everything running up to COMB
, and I can get COMB
to start executing (thanks, I learned about job arrays in Grid Engine; pretty nifty), but now comb-p is breaking. Here is snipped output and a traceback from one of the tasks, task 8:
COMB
wrote: data/quantile/fake/fake.acf.txtACF:
# https://chart.googleapis.com/chart?cht=bvs&&chd=t:0.211,0.199&chs=124x250&chco=224499&chxt=x,y&chxl=0:|1-41|41-81&chxr=1,0,0.23&chm=N,000000,0,-1,12&chbh=42,6,12&chds=a
#lag_min lag_max correlation N
1 41 0.2112 198954341 81 0.1992 1941869
BAD: [ 3.40900000e-04 3.95800000e-07 5.97400000e-05 1.17700000e-04
1.14500000e-08 3.86300000e-10 3.48000000e-17] [[ 1. 0.21123621 0.21123621 0.19921738 0. 0. 0. ] [ 0.21123621 1. 0.21123621 0.21123621 0.19921738 0. 0. ]
[ 0.21123621 0.21123621 1. 0.21123621 0.21123621 0.19921738 0. ]
[ 0.19921738 0.21123621 0.21123621 1. 0.21123621 0.21123621
0.19921738]
[ 0. 0.19921738 0.21123621 0.21123621 1. 0.21123621
0.21123621] [ 0. 0. 0.19921738 0.21123621 0.21123621 1.
0.21123621]
[ 0. 0. 0. 0.19921738 0.21123621 0.21123621 1. ]]
... SNIP ...
wrote: data/quantile/fake/fake.slk.bed
Traceback (most recent call last):
File "/home/lashercd/.virtualenvs/test/bin/comb-p", line 7, in <module>
execfile(__file__)
File "/home/lashercd/tmp/combined-pvalues/cpv/comb-p", line 150, in <module>
main()
File "/home/lashercd/tmp/combined-pvalues/cpv/comb-p", line 31, in main
_pipeline()
File "/home/lashercd/tmp/combined-pvalues/cpv/comb-p", line 120, in _pipeline
for bh, l in fdr.fdr(args.prefix + ".slk.bed", -1, 0.05):
File "/home/lashercd/tmp/combined-pvalues/cpv/../cpv/fdr.py", line 18, in fdr
pvals = np.array([b["p"] for b in bediter(fbed_file, col_num)],
File "/home/lashercd/tmp/combined-pvalues/cpv/../cpv/_common.py", line 37, in bediter
yield {"chrom": l[0], "start": int(l[1]), "end": int(l[2]),
ValueError: invalid literal for int() with base 10: '0.5001'
All the other tasks' logs looked similar, except that task 4 did not have any BAD:
lines.
can you show the output of:
head data/quantile/fake/fake.slk.bed
head data/quantile/fake/fake.fdr.bed
Sure thing
$ head data/quantile/fake/fake.slk.bed
chr1 794763 794814 0.3978 0.3978
chr1 795051 795102 0.4948 0.4948
chr1 795195 795246 0.3619 0.5198
chr1 795231 795282 0.2418 0.5198
chr1 795273 795324 0.4947 0.5133
chr1 795309 795360 0.7655 0.5307
chr1 795375 795426 0.574 0.6842
chr1 795410 795461 0.6201 0.6706
chr1 795452 795503 0.5894 0.9242
chr1 795509 795560 0.5 0.8353
$ head data/quantile/fake/fake.fdr.bed
[EMPTY]
I found a line in data/quantile/fake/fake.slk.bed
that starts with 8753 instead of a chromosome label, has only 3 elements: 8753, 0.5001, and 0.07498.
I suppose there are other lines like that, too
$ cut -f 1 fake.slk.bed | uniq
chr1
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr2
chr20
chr21
chr22
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrX
chrY
8753
chrX
chrY
05
chrX
chrY
Oh, maybe you had multiple jobs running that were writing to the same file?
What does this give:
wc -l data/fit/*.bed | grep -v 8001
$ wc -l data/fit/*.bed | grep -v 8001
2162677 total
I did have all the comb-p tasks running at once. Are they all writing to the same outfile?
oh, is LSB_JOBINDEX not working? this line at the top
#BSUB -J combp[4-9]
shouldn't let you have a "fake". I just put that there for the 0-index.
your $COL
should be p.disease, p.tissue, etc.
Maybe your queing system isn't setting $LSB_JOBINDEX
Oh gosh, I forgot to replace all instances of $LSB_JOBINDEX
with $SGE_TASK_ID
. Let me try that again. I completely missed the
COL=${COLS[$LSB_JOBINDEX]}
line.
All the jobs ran; thanks! One question I have: the cluster reported a maximum memory usage of between about 60G to 70G per comb-p run. Does that amount sound consistent with your experiences?
No! Those comb-p jobs should take maybe 5GB tops. Is your pvalues.bed file about 2.16 million rows?
No! Those comb-p jobs should take maybe 5GB tops. Is your pvalues.bed file about 2.16 million rows?
It is, yes:
$ wc -l data/pvalues.bed
2162407 data/pvalues.bed
I re-ran one of the tasks manually on a large machine and in the end the comb-p main process was consuming about 3.5G memory. I did notice it spawn a lot of sub-processes at points using the multiprocessing module, and each of those seemed to consume equal amounts of memory as the main comb-p process, but that was less than 1G per process. I only asked for 12 cores on each task of comb-p on the Grid Engine jobs, so they shouldn't have consumed more than 12G per task in total. I don't know what happened, but I'll just assume that Grid Engine did a poor job of tracking the consumed memory in this case.
Interesting. It could be something with the generated processes. Let me know if any more suggestions or questions.
The line
should I think be
It gives syntax errors in bash regarding the opening parenthesis.
The sub-directory
data/fit
does not exist and should be created under the line starting with<<FIT
, i.e.,