VDBWRAIR / pathdiscov

Pathogen Discover Pipeline
1 stars 1 forks source link

Diamond Output Question #254

Closed bpfeffer closed 9 years ago

bpfeffer commented 9 years ago

I'm not seeing the output from diamond, but I could be overlooking it. What should I be looking for to ensure it went through that step properly?

demis001 commented 9 years ago

Did you set up your config file to run diamond?

necrolyte2 commented 9 years ago

Since diamond is essentially a drop in replacement for blastx not a ton changes, but there are 3 places you can check to verify that it actually ran

  1. input/param.txt

    Make sure you see diamond in the blast_tax_list. Example:

    blast_task_list megablast,diamond
  2. You can look in interative_blast_phylo/logs/*.o

    You can search through the file for diamond blastx where you should see a few lines where it was run

  3. iterative_blast_phylo_1/tmp_contig_X and/or interative_blast_phylo2/tmp[R1|R2]_X

    In the above example you would expect the diamond temporary results to be in

    tmp_[R1|R2]_2 and/or tmp_contig_2

    Note: It is in _2 because diamond is the 2nd of 2 steps. If you have other steps it may not be 2

bpfeffer commented 9 years ago

So I tried running:

command iterative_blast_phylo

# for the "_list" settings, use a comma-delimited list with no spaces

blast_db_list               /data/db/nt,/data/db/nt #,/data/Scripts/ref/diamond/diamondnr blast db prefix
blast_task_list             megablast,dc-megablast,diamond                              # options are: megablast dc-megablast blastn, diamond

And used --noparam after setting that, it eventually changed it to remove the diamond that I added. Do I also need to adjust the db path so it's not commented out?

necrolyte2 commented 9 years ago

Ya, you will have to remove the # part of the blast_db_list line and place the # before the comment at the end or just remove that comment

End result should look like

command iterative_blast_phylo

# for the "_list" settings, use a comma-delimited list with no spaces

blast_db_list               /data/db/nt,/data/db/nt,/data/Scripts/ref/diamond/diamondnr #blast db prefix
blast_task_list             megablast,dc-megablast,diamond                              # options are: megablast dc-megablast blastn, diamond
bpfeffer commented 9 years ago

Ok, made those adjustments and it finished, but Iterative_phylo_blast_1 folder is empty(except for an empty logs folder) and the symbolic linking step failed because of that. The overall output folder has only: sample.contig.count.txt, sample.R1.count.txt sample.R2.count.txt, qstats.txt, reads.txt, stats.txt

necrolyte2 commented 9 years ago

Seems something definitely is wrong.

After you modified the param.txt did you remove all the results directories that may have been created from the previous run(if there were any)

I'm trying to think of why the folder would be completely empty. Usually if there is an issue at least the log files are generated which makes me think that the command didn't run at all or something.

Can you also look in the analysis.log and see if there is anything to indicate an error in there

bpfeffer commented 9 years ago

Right, I had an empty folder except for param.txt that I edited, and used --noparam.

Here is the part that seems relevant in analysis.log:

[module] iterative_blast_phylo
[iteration] 1
[cmd] mkdir -p /data/bpfeffer/Contam/6_1_15_wrairtest/624_S1_diamond/results/iterative_blast_phylo_1/logs
[cmd] /data/Scripts/pathdiscov/pathdiscov/iterative_blast_phylo/iterative_blast_phylo.pl --sample 624_S1_diamond --paramfile 6_1_15_wrairtest/624_S1_diamond/input/param.txt --outputdir /data/bpfeffer/Contam/6_1_15_wrairtest/624_S1_diamond/results/iterative_blast_phylo_1 --logs /data/bpfeffer/Contam/6_1_15_wrairtest/624_S1_diamond/results/iterative_blast_phylo_1/logs --timestamp 20150602-11.56 --R1 /data/bpfeffer/Contam/6_1_15_wrairtest/624_S1_diamond/results/ray2_assembly_1/ray2_assembly_1.fasta --fastafile yes --run_iteration 1 --contig 1
[error] blast param lists different sizes at /data/Scripts/pathdiscov/pathdiscov/iterative_blast_phylo/iterative_blast_phylo.pl line 124.
[deltat] 0
[iterative_blast_phylo_1] Output: out_r1: /data/bpfeffer/Contam/6_1_15_wrairtest/624_S1_diamond/results/ray2_assembly_1/1.R1.unmap.fastq -- out_r2: /data/bpfeffer/Contam/6_1_15_wrairtest/624_S1_diamond/results/ray2_assembly_1/1.R2.unmap.fastq -- out_contig: /data/bpfeffer/Contam/6_1_15_wrairtest/624_S1_diamond/results/ray2_assembly_1/ray2_assembly_1.fasta

[END] 20150602-13.27
[DELTAT] 5453
Using top 4000 from unmapped read files because --blast_unassembled set
[cmd] cat 6_1_15_wrairtest/624_S1_diamond/results/ray2_assembly_1/1.R1.unmap.fastq | head -4000 > 6_1_15_wrairtest/624_S1_diamond/results/ray2_assembly_1/head.1.R1.unmap.fastq
cat: write error: Broken pipe
[deltat] 0
[cmd] cat 6_1_15_wrairtest/624_S1_diamond/results/ray2_assembly_1/1.R2.unmap.fastq | head -4000 > 6_1_15_wrairtest/624_S1_diamond/results/ray2_assembly_1/head.1.R2.unmap.fastq
cat: write error: Broken pipe
[deltat] 0
necrolyte2 commented 9 years ago

[error] blast param lists different sizes at /data/Scripts/pathdiscov/pathdiscov/iterative_blast_phylo/iterative_blast_phylo.pl line 124.

All the items in your param.txt under iteratvie_blast_phylo have to have the same number of comma separated items

It is probably the blast_options_list and ninst_list that are missing that 3rd option

bpfeffer commented 9 years ago

Are those options comparable to the ones listed in the first two options? They do both only have 2 options right now, so I need to figure out what to put for the third options, if it's application specific or just an internal setting for pathdiscov.

necrolyte2 commented 9 years ago

The blast_options_list is a comma separated list of options that are passed directly to blast/diamond If you used pathdiscov_cli --param to generate the param.txt there should be a default diamond option line after the comment # If not you can use

--compress 0 -p 0 -v -k 10 -w 28 --id 0.7 -c 4 -t TEMPDIR

Replacing TMPDIR with the path to your temporary directory for diamond to use If you have a ton of ram, you can use /dev/shm which is essentially your RAM

for ninst_list just have the same number 3 times such as

ninst_list    5,5,5
bpfeffer commented 9 years ago

I made those changes(including using /dev/shm as a tempdir since we have 1TB on that server) and it ran the pipeline and eventually modified param.txt to remove the changes to those 4 lines back to just 2 blast searches.

necrolyte2 commented 9 years ago

So the pipeline changed your param.txt even though you did --noparam?

bpfeffer commented 9 years ago

Right, here was my command:

nohup pathdiscov_cli --noparam --outdir 624_S1_diamond --R1 06-24_S1_L001_R1_001.fastq.gz --R2 06-24_S1_L001_R2_001.fastq.gz -c 60
necrolyte2 commented 9 years ago

I'm going to run the test dataset here and see what happens

$> pathdiscov_cli --outdir 624_S1_diamond --R1 testData/F.fastq.gz --R2 testData/R.fastq.gz -c 4 --param
$> wget https://gist.githubusercontent.com/necrolyte2/64f3ba20645f589ee2aa/raw/1e1f60415ad5fd1dcf9270f6e9553d291976be11/param.txt-diamond -O 624_S1_diamond/input/param.txt
$> pathdiscov_cli --noparam --outdir 624_S1_diamond --R1 testData/F.fastq.gz --R2 testData/R.fastq.gz -c 4 --noparam
necrolyte2 commented 9 years ago

It didn't change the param.txt on my end when I ran it as above

bpfeffer commented 9 years ago

Is it maybe the placement of --noparam since you put it at the start and the end?

necrolyte2 commented 9 years ago

In theory it should not matter as we are using the argparse module

What you should be able to do is redo the run with a different --outdir and after you start the pipeline open a new terminal and check to see if it changed param.txt as if it is changing it it should happen within the first minute of running the pipeline

necrolyte2 commented 9 years ago

Were you able to run the testData set yet?

bpfeffer commented 9 years ago

Both the testdata sample and my local sample did finish and ran diamond properly this time, but I notice the phylo_1 error log has "ibgomp: Thread creation failed: Resource temporarily unavailable Thread creation failed: Resource temporarily unavailable" so that's new. Quick research seems to point to stack size, does that seem right? Our current is set to "stack size (kbytes, -s) 10240"

necrolyte2 commented 9 years ago

What I think is happening is that since you are specifying -c 60 too many additional threads are being spawned.

So essentially -c 60 should say, you can divide certain aspects into 60 parts so when it gets to iterative_blast_phylo_1 it will split the input files for each blast step into 60 chunks and then run that blast step on them in parallel.

The issue most likely is that --compress 0 -p 0 -v -k 10 -w 28 --id 0.7 -c 4 -t ~/tmp is the default in our config for diamond. The -p 0 says, automatically determine how many cpus to use which is generally whatever this would return cat /proc/cpuinfo | grep 'processor' | wc -l Also, Linux treats Intel's hyperthreading and whatever AMD's equivilent as additional CPU's so for example on our i7 4 core cpus, that command returns 8 instead of 4. So don't be surprised if you have 60 cpus if it returns 120

So lets say that command returns 60. What you end up with is iterative_blast_phylo_* spawning 60*60 threads or 3600 threads.

Poking around on my system I got my system limits

core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 126685
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Here I see my max user processes is set to 1024(not sure if that was imposed by DISA Stig or if that is default, but if that is less than 3600 on your system then most likely the cause of those errors. I suspect that the analysis worked fine and just did not spawn as many threads as it wanted to, but kept on going.

The solution is for us to fix the sample.param file that is shipped and set -p 1 for diamond. I'd be open for a small debate on if we should recommend people set -p to 1 or some other number or even just have the pipeline enforce -p 1 even if the user were to supply -p 0.

Can you rerun that analysis but set -p 1 in your param.txt just so we can verify. Might even run faster as the system isn't overloaded with so many threads.

bpfeffer commented 9 years ago

Ah, I can see why that would be a problem, the system has 160 threads(80 cores at 2 threads per) so that would certainly have caused an issue. Rerunning it now with the -p set to 1.

bpfeffer commented 9 years ago

I ran 3 samples at the same time, the only differences were one sample was in another folder and it was with more cores(60 instead of 10 for each of the other two).

Single 60-thread phylo_blast_1 error log was full of these:

Error: the argument ('value') for option '--evalue' is invalid
cat: blastout*: No such file or directory

The two other samples had:

Error: function Input_stream::Input_stream(const string&, bool) line 74. Error opening file /data/pathdiscov_tmp/diamond_1068963805_0.tmp
Error: Error reading buffer file /data/pathdiscov_tmp/diamond_1068963805_0.tmp (1/0)
Error: Error reading buffer file /data/pathdiscov_tmp/diamond_1068963805_0.tmp (2/0)
Error: Error reading buffer file /data/pathdiscov_tmp/diamond_1068963805_0.tmp (1/0)
Error: Error reading buffer file /data/pathdiscov_tmp/diamond_1068963805_0.tmp (1/0)
Error: Error reading buffer file /data/pathdiscov_tmp/diamond_1068963805_1.tmp (16/0)
Error: function Input_stream::Input_stream(const string&, bool) line 74. Error opening file /data/pathdiscov_tmp/diamond_1068963805_0.tmp
Error: function Input_stream::Input_stream(const string&, bool) line 74. Error opening file /data/pathdiscov_tmp/diamond_1068963805_0.tmp
Error: function Input_stream::Input_stream(const string&, bool) line 74. Error opening file /data/pathdiscov_tmp/diamond_1068963805_0.tmp
Error: Entry not found in BLAST database
Error: Entry not found in BLAST database
Error: Entry not found in BLAST database

and

Error: Error reading buffer file /data/pathdiscov_tmp/diamond_1708646615_0.tmp (101/0)
Error: Error reading buffer file /data/pathdiscov_tmp/diamond_1708646615_0.tmp (42/0)
Error: Error reading buffer file /data/pathdiscov_tmp/diamond_1708646615_0.tmp (26/0)
Error: Error reading buffer file /data/pathdiscov_tmp/diamond_1708646615_0.tmp (25/0)
Error: Error reading buffer file /data/pathdiscov_tmp/diamond_1708646615_2.tmp (2/0)
Error: function Input_stream::Input_stream(const string&, bool) line 74. Error opening file /data/pathdiscov_tmp/diamond_1708646615_0.tmp
Error: function Input_stream::Input_stream(const string&, bool) line 74. Error opening file /data/pathdiscov_tmp/diamond_1708646615_0.tmp
Error: function Input_stream::Input_stream(const string&, bool) line 74. Error opening file /data/pathdiscov_tmp/diamond_1708646615_0.tmp
Error: function Input_stream::Input_stream(const string&, bool) line 74. Error opening file /data/pathdiscov_tmp/diamond_1708646615_0.tmp
Error: function Input_stream::Input_stream(const string&, bool) line 74. Error opening file /data/pathdiscov_tmp/diamond_1708646615_0.tmp
necrolyte2 commented 9 years ago

First sample

Error: the argument ('value') for option '--evalue' is invalid The evalue option for blastn and diamond both are pulled from param.txt

That makes it seem like maybe you had a misconfiguration in your param.txt for that sample which is why blast/diamond did not run to create the blastout files

Second/Third samples

I'm not sure why it was decided for the pipeline to split standard error and standard output into two files like this to put into logs. It makes it so difficult to figure out where the issues come from. I can only make assumptions that the buffer errors are being spit out by diamond

Can you give any insight about /data/pathdiscovtmp That filesystem didn't fill up did it? Are there other diamond files in there or is it void of `diamond*.tmp`

You could try removing the -t /data/pathdiscov_tmp from your param.txt. This will tell diamond to just use RAM to do it's temp file work. Since you have 1TB of RAM it should not be an issue.

bpfeffer commented 9 years ago

I found the error in the first sample param.txt so that's fine. I reran the second sample using -p 1 and without the -t folder so it would use ram, it still took a while(8+ hours). I then adjusted the getorf(at mike's recommendation) from 60 to 300 and got it down to a much faster time(80 minutes).

The question now is, are any of the results from diamond reflected in the output folder results? I wasn't able to find anything different there than the run without diamond, the only changes were some extra files and a folder in blast_phylo_1 folder.

necrolyte2 commented 9 years ago

You should not see any difference in output files as diamond is made to be an almost 100% drop in replacement The only way to get a good gut feeling that it actually run is by inspecting iterative_blast_phylo_/logs/.o

In there you should see that it ran diamond.

Example from one of our test runs:

diamond blastx  --compress 0 -p 0 -v -k 10 -w 28 --id 0.7 -c 4 -t ~/tmp -q  tmpsplit000 -d /home/AMED/tyghe.vallard/databases/diamond/diamondnr  -o blastout000

Then since I know from our param.txt that diamond was the 3rd iteration

blast_task_list             megablast,dc-megablast,diamond

I can look in tmp_contig_3 and find blastout000 where diamond dumped the output

necrolyte2 commented 9 years ago

@bpfeffer where you ever able to verify that diamond ran.

I realize now that you should also find a directory called orf_filter inside of one of the tmp_contig_3 directory In this directory you should see

bpfeffer commented 9 years ago

I did verify that diamond ran, but it seems as though setting the getorf to 300 caused it to not have anything useful. The previous version I ran with the default 60 did have some results and the aux report for 60 did have some results that were from diamond.