More trouble with examples/charm/run.sh

gotgenes commented 12 years ago

The line

for i in (awk -v cols=$NCOL -v step=$STEP 'BEGIN{for(i=2;i<cols;i+=step){print i }}'); do

should I think be

for i in `awk -v cols=$NCOL -v step=$STEP 'BEGIN{for(i=2;i<cols;i+=step){print i }}'`; do

It gives syntax errors in bash regarding the opening parenthesis.

The sub-directory data/fitdoes not exist and should be created under the line starting with <<FIT, i.e.,

<<FIT
if [ ! -d data/fit ]; then
    mkdir data/fit
fi

brentp commented 12 years ago

Ah, yes, that's correct. normally, I use

$(command)

but that gives problems after commenting out with heredocs so I removed the $.

I'll make the other change you suggested as well.

brentp commented 12 years ago

thanks again. if you have any troubles with installation, let me know. I'd like that to be as smooth as possible.

gotgenes commented 12 years ago

Actually, there's something going awry still in that for loop for me. The $start and $end values are identical, I wind up submitting a tremendous number of jobs, and all are failing in the R call, in which I'm getting

Error in Ops.factor(df$SampleID, data$ID) :
  level sets of factors are different
Calls: stopifnot -> Ops.factor
Execution halted
Error in Ops.factor(df$SampleID, data$ID) :
  level sets of factors are different
Calls: stopifnot -> Ops.factor
Execution halted

brentp commented 12 years ago

@gotgenes could you check out the latest revision. that was another case of me pulling out the parens.

gotgenes commented 12 years ago

That does balance out the $start and $end issue.

By the way, the here-documents are unbalanced at this point, since <<NORMALIZE is commented out (but not it's matching end string). Also, is there a way to only run the portion denoted in a here-document? I wasn't sure what you meant by

Ah, yes, that's correct. normally, I use
$(command)

brentp commented 12 years ago

Yeah, I just usually comment and uncomment the heredocs. I just submitted a revision so you can do:

bash run.sh FIT

to run that section. It might be a bit more pleasant to use. Mostly I just want the run.sh to demonstrate the work-flow.

Re $(command) I was meaning the equivalence of

files=`ls`

and

files=$(ls)

except the latter sometimes causes problems within heredocs like <<NORMALIZE.

gotgenes commented 12 years ago

In the FIT section, I'm still stuck with

Error in Ops.factor(df$SampleID, data$ID) :
  level sets of factors are different
Calls: stopifnot -> Ops.factor
Execution halted

brentp commented 12 years ago

OK. there's an errant space in the csv file. If you remove it, then the matching should go ok. I added a quick perl command within FIT to remove it.

gotgenes commented 12 years ago

Thanks, Brent.

Another question: do the fit.lm.R jobs really consume 7GB of RAM each?

brentp commented 12 years ago

They shouldn't, but I was doing that as a quick way to limit the number of jobs per node. If there are too many, it hits the disk pretty hard and things grind to a halt. I'm open to a better way to limit.

gotgenes commented 12 years ago

It looks like they only take a little over 200 MB of RAM, actually, on our x86_64 cluster.

Also, can I suggest

perl -pi -e 's/,\s/,/' data/natgen2009.csv

instead? Without the commas, it would put all the data on one line if re-run. Actually, maybe this line should go into the GET, so it really will be run only once.

I'm open to a better way to limit.

I'm running this on a Sun (Oracle) Grid Engine cluster with qsub instead of LSF with bsub. The way we have to limit the jobs is to use a flag -l ul1=<number> (or user_limit1 instead of ul1). We each get 1000 user limit "shares", so setting -l ul1=10 gets a maximum of 100 jobs going. I'm not sure what the bsub equivalent is, but the documentation suggests that when we had LSF there was some sort of "'limited' queue" used to keep too many jobs from running. Not quite sure what that means.

gotgenes commented 12 years ago

Well, I've gotten everything running up to COMB, and I can get COMB to start executing (thanks, I learned about job arrays in Grid Engine; pretty nifty), but now comb-p is breaking. Here is snipped output and a traceback from one of the tasks, task 8:

COMB
wrote: data/quantile/fake/fake.acf.txtACF:
# https://chart.googleapis.com/chart?cht=bvs&&chd=t:0.211,0.199&chs=124x250&chco=224499&chxt=x,y&chxl=0:|1-41|41-81&chxr=1,0,0.23&chm=N,000000,0,-1,12&chbh=42,6,12&chds=a
#lag_min        lag_max correlation     N
1       41      0.2112  198954341      81      0.1992  1941869

BAD: [  3.40900000e-04   3.95800000e-07   5.97400000e-05   1.17700000e-04
   1.14500000e-08   3.86300000e-10   3.48000000e-17] [[ 1.          0.21123621  0.21123621  0.19921738  0.          0.          0.        ] [ 0.21123621  1.          0.21123621  0.21123621  0.19921738  0.          0.        ]
 [ 0.21123621  0.21123621  1.          0.21123621  0.21123621  0.19921738   0.        ]
 [ 0.19921738  0.21123621  0.21123621  1.          0.21123621  0.21123621
   0.19921738]
 [ 0.          0.19921738  0.21123621  0.21123621  1.          0.21123621
   0.21123621] [ 0.          0.          0.19921738  0.21123621  0.21123621  1.
   0.21123621]
 [ 0.          0.          0.          0.19921738  0.21123621  0.21123621   1.        ]]
... SNIP ...
wrote: data/quantile/fake/fake.slk.bed
Traceback (most recent call last):
  File "/home/lashercd/.virtualenvs/test/bin/comb-p", line 7, in <module>
    execfile(__file__)
  File "/home/lashercd/tmp/combined-pvalues/cpv/comb-p", line 150, in <module>
    main()
  File "/home/lashercd/tmp/combined-pvalues/cpv/comb-p", line 31, in main
    _pipeline()
  File "/home/lashercd/tmp/combined-pvalues/cpv/comb-p", line 120, in _pipeline
    for bh, l in fdr.fdr(args.prefix + ".slk.bed", -1, 0.05):
  File "/home/lashercd/tmp/combined-pvalues/cpv/../cpv/fdr.py", line 18, in fdr
    pvals = np.array([b["p"] for b in bediter(fbed_file, col_num)],
  File "/home/lashercd/tmp/combined-pvalues/cpv/../cpv/_common.py", line 37, in bediter
    yield  {"chrom": l[0], "start": int(l[1]), "end": int(l[2]),
ValueError: invalid literal for int() with base 10: '0.5001'

All the other tasks' logs looked similar, except that task 4 did not have any BAD: lines.

brentp commented 12 years ago

can you show the output of:

head data/quantile/fake/fake.slk.bed
head data/quantile/fake/fake.fdr.bed

gotgenes commented 12 years ago

Sure thing

$ head data/quantile/fake/fake.slk.bed
chr1    794763  794814  0.3978  0.3978
chr1    795051  795102  0.4948  0.4948
chr1    795195  795246  0.3619  0.5198
chr1    795231  795282  0.2418  0.5198
chr1    795273  795324  0.4947  0.5133
chr1    795309  795360  0.7655  0.5307
chr1    795375  795426  0.574   0.6842
chr1    795410  795461  0.6201  0.6706
chr1    795452  795503  0.5894  0.9242
chr1    795509  795560  0.5     0.8353
$ head data/quantile/fake/fake.fdr.bed
[EMPTY]

I found a line in data/quantile/fake/fake.slk.bed that starts with 8753 instead of a chromosome label, has only 3 elements: 8753, 0.5001, and 0.07498.

I suppose there are other lines like that, too

$ cut -f 1 fake.slk.bed | uniq
chr1
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr2
chr20
chr21
chr22
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrX
chrY
8753
chrX
chrY
05
chrX
chrY

brentp commented 12 years ago

Oh, maybe you had multiple jobs running that were writing to the same file?

brentp commented 12 years ago

What does this give:

wc -l data/fit/*.bed | grep -v 8001

gotgenes commented 12 years ago

$ wc -l data/fit/*.bed | grep -v 8001
  2162677 total

I did have all the comb-p tasks running at once. Are they all writing to the same outfile?

brentp commented 12 years ago

oh, is LSB_JOBINDEX not working? this line at the top

#BSUB -J combp[4-9]

shouldn't let you have a "fake". I just put that there for the 0-index. your $COL should be p.disease, p.tissue, etc.

Maybe your queing system isn't setting $LSB_JOBINDEX

gotgenes commented 12 years ago

Oh gosh, I forgot to replace all instances of $LSB_JOBINDEX with $SGE_TASK_ID. Let me try that again. I completely missed the

COL=${COLS[$LSB_JOBINDEX]}

line.

gotgenes commented 12 years ago

All the jobs ran; thanks! One question I have: the cluster reported a maximum memory usage of between about 60G to 70G per comb-p run. Does that amount sound consistent with your experiences?

brentp commented 12 years ago

No! Those comb-p jobs should take maybe 5GB tops. Is your pvalues.bed file about 2.16 million rows?

gotgenes commented 12 years ago

No! Those comb-p jobs should take maybe 5GB tops. Is your pvalues.bed file about 2.16 million rows?

It is, yes:

$ wc -l data/pvalues.bed
2162407 data/pvalues.bed

gotgenes commented 12 years ago

I re-ran one of the tasks manually on a large machine and in the end the comb-p main process was consuming about 3.5G memory. I did notice it spawn a lot of sub-processes at points using the multiprocessing module, and each of those seemed to consume equal amounts of memory as the main comb-p process, but that was less than 1G per process. I only asked for 12 cores on each task of comb-p on the Grid Engine jobs, so they shouldn't have consumed more than 12G per task in total. I don't know what happened, but I'll just assume that Grid Engine did a poor job of tracking the consumed memory in this case.

brentp commented 12 years ago

Interesting. It could be something with the generated processes. Let me know if any more suggestions or questions.

brentp / combined-pvalues

More trouble with examples/charm/run.sh #9