Closed rstrahan closed 6 years ago
Hi @rstrahan! Thanks for dropping in with the issue. Accessing BAM data from S3 is thornier than I'd like and we've got an action item to write this up for the 0.23.0 release (see #1643). I'll probably take this on tonight, since you're asking here. To get you unstuck, here's a set of pointers. From https://github.com/bigdatagenomics/mango/issues/311, you're on EMR, so you should be on a pretty up-to-date version of Hadoop, and your IAM roles should be configured properly. What you'll need to do is:
s3a
scheme, instead of the s3
scheme. This is the latest version of the S3 file access code path in Hadoop.s3a
scheme. I'm not familiar with how EMR attaches JARs, but if you were using spark-submit
/adam-submit
, you could do --packages net.fnothaft:jsr203-s3a
com.amazonaws:aws-java-sdk-pom:1.10.34
and org.apache.hadoop:hadoop-aws:2.7.4
. I don't think you'll need these on EMR, though.I'll clean this up further tonight and bundle this in our docs. Let me know if you run into any further issues!
Hey @fnothaft ,
Thanks for taking time to look at this issue. I actually followed your instructions above, except used maven for the jar dependencies. I took these steps:
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
However, it looks like I had the same issues with s3a on the emr (Can't verify locally at this time).
java.nio.file.ProviderNotFoundException: Provider "s3a" not found
at java.nio.file.FileSystems.newFileSystem(FileSystems.java:341)
at org.seqdoop.hadoop_bam.util.NIOFileUtil.asPath(NIOFileUtil.java:40)
at org.seqdoop.hadoop_bam.BAMRecordReader.initialize(BAMRecordReader.java:140)
at org.seqdoop.hadoop_bam.BAMInputFormat.createRecordReader(BAMInputFormat.java:121)
at org.seqdoop.hadoop_bam.AnySAMInputFormat.createRecordReader(AnySAMInputFormat.java:190)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:180)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:179)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
@rstrahan @dmmiller612 I've been using the s3a
protocol on various versions of EMR without any issues, and without any additional dependencies or configuration, although I'm using adam-shell
via ssh and ADAM interactively in Zeppelin, not adam-submit
.
@heuermh you've been doing s3a with Parquet, no? This issue is specific to loading BAM, CRAM, or VCF, where the nio provider is needed for Hadoop-BAM to access the BAI, CRAI, or TBI.
@dmmiller612 let me look into this a bit more. The TL;DR is that there's something to do with classpath/classloader config. I've run into this on other distros similar to EMR.
@heuermh you've been doing s3a with Parquet, no? This issue is specific to loading BAM, CRAM, or VCF, where the nio provider is needed for Hadoop-BAM to access the BAI, CRAI, or TBI.
I haven't been attempting to access the index directly (e.g. using ADAMContext.loadIndexedBam
) so perhaps I've missed the issue. I have working session this afternoon with Be The Match folks and will look into it.
All Hadoop-BAM BAM reads create a nio provider to try and find the index, even if you aren't using the index to filter intervals.
@fnothaft I'll continue to look as well. I simplified my search and am just using adam.loadAlignments("s3a://bucket-name/something.bam")
, and it can still reproduce the issue.
Also worth mentioning that I am using spark 2.2. I don't think that would matter, I just wanted to give a bit of context in case it does.
@dmmiller612 Is this still an issue? We've since deployed the s3 doc (although it doesn't look any different than what @fnothaft described above) and released ADAM version 0.23.0.
@heuermh I can check later, but I still experience the same error in hadoop-bam. That doc above looks like it is specifically for adam files and not bam files. The problem seems to be that hadoop-bam uses nio to look for the bai file, and it isn't getting registered, even if I manually add an s3 nio package.
Thanks, I'll be on EMR later today and will do some further investigation.
Hi @dmmiller612 ! How are you adding the JARs? I'm not familiar with EMR, but depending on how they attach dependencies, the s3 NIO JAR may not be visible to the correct classloader. As far as I can tell, the NIO system searches a specific classloader for the NIO Fs implementations. What's worked most reliably for me is to have the NIO library on the executor classpaths when the Spark executors boot.
The approach for doing this was documented in https://github.com/bigdatagenomics/adam/commit/90572b57586ae02779536b03ffe1ae7adc038ee9 or earlier.
I'm trying to transformAlignments from BAM file in S3, e.g.:
It fails with:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 92 in stage 0.0 failed 60 times, most recent failure: Lost task 92.59 in stage 0.0 (TID 1266, ip-10-184-8-118.ec2.internal, executor 1): java.nio.file.ProviderNotFoundException: Provider "s3" not found at java.nio.file.FileSystems.newFileSystem(FileSystems.java:341) at org.seqdoop.hadoop_bam.util.NIOFileUtil.asPath(NIOFileUtil.java:40) ...
If I stage the input BAM file onHDFS, the problem is resolved (the S3 output path works fine - only S3 input path causes problems).
Do you have any pointers or fixes to get transformAlignment to support S3 input BAM files?