bigdatagenomics / adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Apache License 2.0
1k stars 308 forks source link

java.nio.file.ProviderNotFoundException (Provider "s3" not found) #1732

Closed rstrahan closed 6 years ago

rstrahan commented 7 years ago

I'm trying to transformAlignments from BAM file in S3, e.g.:

adam-submit transformAlignments s3://1000genomes/phase3/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam s3://<mybucket>/1000genomes/adam/bam=HG00154/

It fails with: org.apache.spark.SparkException: Job aborted due to stage failure: Task 92 in stage 0.0 failed 60 times, most recent failure: Lost task 92.59 in stage 0.0 (TID 1266, ip-10-184-8-118.ec2.internal, executor 1): java.nio.file.ProviderNotFoundException: Provider "s3" not found at java.nio.file.FileSystems.newFileSystem(FileSystems.java:341) at org.seqdoop.hadoop_bam.util.NIOFileUtil.asPath(NIOFileUtil.java:40) ...

If I stage the input BAM file onHDFS, the problem is resolved (the S3 output path works fine - only S3 input path causes problems).

hadoop fs -cp s3://1000genomes/phase3/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam /adam/HG00154.bam  

adam-submit transformAlignments /adam/HG00154.bam s3://<mybucket>/1000genomes/adam/bam=HG00154/

Do you have any pointers or fixes to get transformAlignment to support S3 input BAM files?

fnothaft commented 7 years ago

Hi @rstrahan! Thanks for dropping in with the issue. Accessing BAM data from S3 is thornier than I'd like and we've got an action item to write this up for the 0.23.0 release (see #1643). I'll probably take this on tonight, since you're asking here. To get you unstuck, here's a set of pointers. From https://github.com/bigdatagenomics/mango/issues/311, you're on EMR, so you should be on a pretty up-to-date version of Hadoop, and your IAM roles should be configured properly. What you'll need to do is:

I'll clean this up further tonight and bundle this in our docs. Let me know if you run into any further issues!

dmmiller612 commented 7 years ago

Hey @fnothaft ,

Thanks for taking time to look at this issue. I actually followed your instructions above, except used maven for the jar dependencies. I took these steps:

However, it looks like I had the same issues with s3a on the emr (Can't verify locally at this time).

java.nio.file.ProviderNotFoundException: Provider "s3a" not found
    at java.nio.file.FileSystems.newFileSystem(FileSystems.java:341)
    at org.seqdoop.hadoop_bam.util.NIOFileUtil.asPath(NIOFileUtil.java:40)
    at org.seqdoop.hadoop_bam.BAMRecordReader.initialize(BAMRecordReader.java:140)
    at org.seqdoop.hadoop_bam.BAMInputFormat.createRecordReader(BAMInputFormat.java:121)
    at org.seqdoop.hadoop_bam.AnySAMInputFormat.createRecordReader(AnySAMInputFormat.java:190)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:180)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:179)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
heuermh commented 7 years ago

@rstrahan @dmmiller612 I've been using the s3a protocol on various versions of EMR without any issues, and without any additional dependencies or configuration, although I'm using adam-shell via ssh and ADAM interactively in Zeppelin, not adam-submit.

fnothaft commented 7 years ago

@heuermh you've been doing s3a with Parquet, no? This issue is specific to loading BAM, CRAM, or VCF, where the nio provider is needed for Hadoop-BAM to access the BAI, CRAI, or TBI.

fnothaft commented 7 years ago

@dmmiller612 let me look into this a bit more. The TL;DR is that there's something to do with classpath/classloader config. I've run into this on other distros similar to EMR.

heuermh commented 7 years ago

@heuermh you've been doing s3a with Parquet, no? This issue is specific to loading BAM, CRAM, or VCF, where the nio provider is needed for Hadoop-BAM to access the BAI, CRAI, or TBI.

I haven't been attempting to access the index directly (e.g. using ADAMContext.loadIndexedBam) so perhaps I've missed the issue. I have working session this afternoon with Be The Match folks and will look into it.

fnothaft commented 7 years ago

All Hadoop-BAM BAM reads create a nio provider to try and find the index, even if you aren't using the index to filter intervals.

dmmiller612 commented 7 years ago

@fnothaft I'll continue to look as well. I simplified my search and am just using adam.loadAlignments("s3a://bucket-name/something.bam"), and it can still reproduce the issue.

dmmiller612 commented 7 years ago

Also worth mentioning that I am using spark 2.2. I don't think that would matter, I just wanted to give a bit of context in case it does.

heuermh commented 6 years ago

@dmmiller612 Is this still an issue? We've since deployed the s3 doc (although it doesn't look any different than what @fnothaft described above) and released ADAM version 0.23.0.

dmmiller612 commented 6 years ago

@heuermh I can check later, but I still experience the same error in hadoop-bam. That doc above looks like it is specifically for adam files and not bam files. The problem seems to be that hadoop-bam uses nio to look for the bai file, and it isn't getting registered, even if I manually add an s3 nio package.

heuermh commented 6 years ago

Thanks, I'll be on EMR later today and will do some further investigation.

fnothaft commented 6 years ago

Hi @dmmiller612 ! How are you adding the JARs? I'm not familiar with EMR, but depending on how they attach dependencies, the s3 NIO JAR may not be visible to the correct classloader. As far as I can tell, the NIO system searches a specific classloader for the NIO Fs implementations. What's worked most reliably for me is to have the NIO library on the executor classpaths when the Spark executors boot.

fnothaft commented 6 years ago

The approach for doing this was documented in https://github.com/bigdatagenomics/adam/commit/90572b57586ae02779536b03ffe1ae7adc038ee9 or earlier.