java.nio.file.ProviderNotFoundException (Provider "s3" not found)

rstrahan commented 7 years ago

I'm trying to transformAlignments from BAM file in S3, e.g.:

adam-submit transformAlignments s3://1000genomes/phase3/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam s3://<mybucket>/1000genomes/adam/bam=HG00154/

It fails with: org.apache.spark.SparkException: Job aborted due to stage failure: Task 92 in stage 0.0 failed 60 times, most recent failure: Lost task 92.59 in stage 0.0 (TID 1266, ip-10-184-8-118.ec2.internal, executor 1): java.nio.file.ProviderNotFoundException: Provider "s3" not found at java.nio.file.FileSystems.newFileSystem(FileSystems.java:341) at org.seqdoop.hadoop_bam.util.NIOFileUtil.asPath(NIOFileUtil.java:40) ...

If I stage the input BAM file onHDFS, the problem is resolved (the S3 output path works fine - only S3 input path causes problems).

hadoop fs -cp s3://1000genomes/phase3/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam /adam/HG00154.bam  

adam-submit transformAlignments /adam/HG00154.bam s3://<mybucket>/1000genomes/adam/bam=HG00154/

Do you have any pointers or fixes to get transformAlignment to support S3 input BAM files?

fnothaft commented 7 years ago

Hi @rstrahan! Thanks for dropping in with the issue. Accessing BAM data from S3 is thornier than I'd like and we've got an action item to write this up for the 0.23.0 release (see #1643). I'll probably take this on tonight, since you're asking here. To get you unstuck, here's a set of pointers. From https://github.com/bigdatagenomics/mango/issues/311, you're on EMR, so you should be on a pretty up-to-date version of Hadoop, and your IAM roles should be configured properly. What you'll need to do is:

Use the s3a scheme, instead of the s3 scheme. This is the latest version of the S3 file access code path in Hadoop.
Include the JARs from https://github.com/fnothaft/jsr203-s3a --> http://search.maven.org/#artifactdetails%7Cnet.fnothaft%7Cjsr203-s3a%7C0.0.1%7Cjar . This includes a file system provider for the s3a scheme. I'm not familiar with how EMR attaches JARs, but if you were using spark-submit/adam-submit, you could do --packages net.fnothaft:jsr203-s3a
You may also need com.amazonaws:aws-java-sdk-pom:1.10.34 and org.apache.hadoop:hadoop-aws:2.7.4. I don't think you'll need these on EMR, though.

I'll clean this up further tonight and bundle this in our docs. Let me know if you run into any further issues!

dmmiller612 commented 7 years ago

Hey @fnothaft ,

Thanks for taking time to look at this issue. I actually followed your instructions above, except used maven for the jar dependencies. I took these steps:

added the s3a dependency to spark sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
Added the maven dependency listed above (which automatically added it to the jar with dependencies
I already had the sdk and aws-hadoop in the jar

However, it looks like I had the same issues with s3a on the emr (Can't verify locally at this time).

java.nio.file.ProviderNotFoundException: Provider "s3a" not found
    at java.nio.file.FileSystems.newFileSystem(FileSystems.java:341)
    at org.seqdoop.hadoop_bam.util.NIOFileUtil.asPath(NIOFileUtil.java:40)
    at org.seqdoop.hadoop_bam.BAMRecordReader.initialize(BAMRecordReader.java:140)
    at org.seqdoop.hadoop_bam.BAMInputFormat.createRecordReader(BAMInputFormat.java:121)
    at org.seqdoop.hadoop_bam.AnySAMInputFormat.createRecordReader(AnySAMInputFormat.java:190)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:180)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:179)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

heuermh commented 7 years ago

@rstrahan @dmmiller612 I've been using the s3a protocol on various versions of EMR without any issues, and without any additional dependencies or configuration, although I'm using adam-shell via ssh and ADAM interactively in Zeppelin, not adam-submit.

fnothaft commented 7 years ago

@heuermh you've been doing s3a with Parquet, no? This issue is specific to loading BAM, CRAM, or VCF, where the nio provider is needed for Hadoop-BAM to access the BAI, CRAI, or TBI.

fnothaft commented 7 years ago

@dmmiller612 let me look into this a bit more. The TL;DR is that there's something to do with classpath/classloader config. I've run into this on other distros similar to EMR.

heuermh commented 7 years ago

@heuermh you've been doing s3a with Parquet, no? This issue is specific to loading BAM, CRAM, or VCF, where the nio provider is needed for Hadoop-BAM to access the BAI, CRAI, or TBI.

I haven't been attempting to access the index directly (e.g. using ADAMContext.loadIndexedBam) so perhaps I've missed the issue. I have working session this afternoon with Be The Match folks and will look into it.

fnothaft commented 7 years ago

All Hadoop-BAM BAM reads create a nio provider to try and find the index, even if you aren't using the index to filter intervals.

dmmiller612 commented 7 years ago

@fnothaft I'll continue to look as well. I simplified my search and am just using adam.loadAlignments("s3a://bucket-name/something.bam"), and it can still reproduce the issue.

dmmiller612 commented 7 years ago

Also worth mentioning that I am using spark 2.2. I don't think that would matter, I just wanted to give a bit of context in case it does.

heuermh commented 6 years ago

@dmmiller612 Is this still an issue? We've since deployed the s3 doc (although it doesn't look any different than what @fnothaft described above) and released ADAM version 0.23.0.

dmmiller612 commented 6 years ago

@heuermh I can check later, but I still experience the same error in hadoop-bam. That doc above looks like it is specifically for adam files and not bam files. The problem seems to be that hadoop-bam uses nio to look for the bai file, and it isn't getting registered, even if I manually add an s3 nio package.

heuermh commented 6 years ago

Thanks, I'll be on EMR later today and will do some further investigation.

fnothaft commented 6 years ago

Hi @dmmiller612 ! How are you adding the JARs? I'm not familiar with EMR, but depending on how they attach dependencies, the s3 NIO JAR may not be visible to the correct classloader. As far as I can tell, the NIO system searches a specific classloader for the NIO Fs implementations. What's worked most reliably for me is to have the NIO library on the executor classpaths when the Spark executors boot.

fnothaft commented 6 years ago

The approach for doing this was documented in https://github.com/bigdatagenomics/adam/commit/90572b57586ae02779536b03ffe1ae7adc038ee9 or earlier.

bigdatagenomics / adam

java.nio.file.ProviderNotFoundException (Provider "s3" not found) #1732