Closed SZLux closed 6 years ago
Thanks to a suggestion of mwalker174, I solved the issue. I had to specify, for each input and output file, the full path from the root. All the HPC nodes share the same hdfs, so it worked.
What leaves me a bit perplexed, is that the task took 5 minutes to run on a 16 cores nodes, and exactly the same to run on a master-worker setup with 5 workers and 16 cores each (the log cofirms that the tasks were distributed to the workers IP addreses). Is this something expected? Shall I maybe try with alarger input to see the difference in performances? Thank you!
Thanks for posting this in a new ticket @SZLux.
I'm not surprised that 1 vs 5 nodes did not make a difference on the tutorial files. The 5 minutes is likely overhead with setting up the Spark cluster (though that sounds a bit long). I would expect you to see better scaling with larger microbe/host references and samples.
Hi, I run the PathSeqPipelineSpark on a SPARK HPC with a master and several workers.
I downloaded SPARK 2.2.0 with hadoop 2.7.3 Java is 1.8.0_131 I set the java classpath (I think correctly)
The command runs well without the --spark-master option, so the files are at the right place, but when I run the following command line:
gatk PathSeqPipelineSpark --spark-master spark://XX.XX.XX.XX:7077 --input test_sample.bam --filter-bwa-image hg19mini.fasta.img --kmer-file hg19mini.hss --min-clipped-read-length 70 --microbe-fasta e_coli_k12.fasta --microbe-bwa-image e_coli_k12.fasta.img --taxonomy-file e_coli_k12.db --output output.pathseq.bam --verbosity DEBUG --scores-output output.pathseq.txt -- --spark-runner SPARK
I get the following error:
Thank you.
Full log:
`