bigdatagenomics / adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Apache License 2.0
1k stars 308 forks source link

s3a URL not working #2096

Open dekinsitro opened 5 years ago

dekinsitro commented 5 years ago

I am trying to follow the documentation to allow ADAM to read a BAM file from S3.
According to https://adam.readthedocs.io/en/latest/deploying/aws/#input-and-output-data-on-hdfs-and-s3 I should run a command like this: adam-submit --packages com.amazonaws:aws-java-sdk-pom:1.11.463,net.fnothaft:jsr203-s3a:0.0.1 -- transformAlignments s3a://1000genomes/phase1/data/NA12878/exome_alignment/NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam /mnt/test.adam

When I run that command, I get an error with many unresolved dependency jars:

:: problems summary :: :::: WARNINGS [NOT FOUND ] org.apache.commons#commons-math3;3.1.1!commons-math3.jar (0ms)

.... :::: WARNINGS [NOT FOUND ] org.apache.commons#commons-math3;3.1.1!commons-math3.jar (0ms)

    ==== local-m2-cache: tried

      file:/home/ubuntu/.m2/repository/org/apache/commons/commons-math3/3.1.1/commons-math3-3.1.1.jar

            [NOT FOUND  ] commons-collections#commons-collections;3.2.1!commons-collections.jar (0ms)

It's not clear to me (I don't work with Java much) what is going on, but my guess is that the tool that should be downloading package dependencies doesn't run, and it's just looking for cached data in the maven cache.

heuermh commented 5 years ago

Hello @dekinsitro, thank you for submitting this issue.

The docs suggest including org.apache.hadoop:hadoop-aws:2.7.4, so you may want to try

adam-submit \
  --packages com.amazonaws:aws-java-sdk-pom:1.11.463,org.apache.hadoop:hadoop-aws:2.7.4,net.fnothaft:jsr203-s3a:0.0.1 \
  -- \
  transformAlignments \
  s3a://1000genomes/phase1/data/NA12878/exome_alignment/NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam \
  /mnt/test.adam

Are you running Spark on AWS, perhaps via EMR?

dekinsitro commented 5 years ago

I'm running on a simple Ubuntu 18.04 EC2 VM, not EMR. Spark/EMR on AWS already includes the necessary s3 connector jars.

Using your command changes the error, but still roughly the same problem: adam-submit \ --packages com.amazonaws:aws-java-sdk-pom:1.11.463,org.apache.hadoop:hadoop-aws:2.7.4,net.fnothaft:jsr203-s3a:0.0.1 \ -- \ transformAlignments \ s3a://1000genomes/phase1/data/NA12878/exome_alignment/NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam \ /mnt/test.adam

produces:

::::::::::::::::::::::::::::::::::::::::::::::

            ::              FAILED DOWNLOADS            ::

            :: ^ see resolution messages for details  ^ ::

            ::::::::::::::::::::::::::::::::::::::::::::::

            :: com.google.code.findbugs#jsr305;3.0.0!jsr305.jar

            :: org.apache.commons#commons-math3;3.1.1!commons-math3.jar

            :: com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle)

            :: org.codehaus.jettison#jettison;1.1!jettison.jar(bundle)

            :: com.sun.xml.bind#jaxb-impl;2.2.3-1!jaxb-impl.jar

            :: org.codehaus.jackson#jackson-jaxrs;1.9.13!jackson-jaxrs.jar

            :: org.codehaus.jackson#jackson-xc;1.9.13!jackson-xc.jar

            :: com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle)

            :: org.tukaani#xz;1.0!xz.jar

            :: jline#jline;0.9.94!jline.jar

            ::::::::::::::::::::::::::::::::::::::::::::::

I don't see any indication the packages are even being attempted to download, just looking for them in the cache.

heuermh commented 5 years ago

Right, things can be a little bit different depending on the Spark installation.

For example, for me on Cloudera CDH only the jsr203-s3a is necessary

$ export AWS_SECRET_ACCESS_KEY=...
$ export AWS_ACCESS_KEY_ID=...
$ adam-submit --packages net.fnothaft:jsr203-s3a:0.0.1 ...

I don't know why your version of Spark isn't trying to download the necessary dependencies, perhaps there are some network or ivy settings issues?

Another option would be to pull the dependencies into your local ivy cache using ivy directly

$ ivy -dependency com.google.code.findbugs jsr305 3.0.0
:: loading settings :: url = jar:file:/usr/local/Cellar/ivy/2.4.0/libexec/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
:: resolving dependencies :: com.google.code.findbugs#jsr305-caller;working
    confs: [default]
    found com.google.code.findbugs#jsr305;3.0.0 in public
downloading https://repo1.maven.org/maven2/com/google/code/findbugs/jsr305/3.0.0/jsr305-3.0.0.jar ...
......... (19kB)
.. (0kB)
    [SUCCESSFUL ] com.google.code.findbugs#jsr305;3.0.0!jsr305.jar (73ms)
downloading https://repo1.maven.org/maven2/com/google/code/findbugs/jsr305/3.0.0/jsr305-3.0.0-sources.jar ...
........ (16kB)
.. (0kB)
    [SUCCESSFUL ] com.google.code.findbugs#jsr305;3.0.0!jsr305.jar(source) (59ms)
downloading https://repo1.maven.org/maven2/com/google/code/findbugs/jsr305/3.0.0/jsr305-3.0.0-javadoc.jar ...
...................... (173kB)
.. (0kB)
    [SUCCESSFUL ] com.google.code.findbugs#jsr305;3.0.0!jsr305.jar(javadoc) (88ms)
:: resolution report :: resolve 909ms :: artifacts dl 224ms
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   1   |   1   |   1   |   0   ||   3   |   3   |
    ---------------------------------------------------------------------

I'll try hopping on an Ubuntu EC2 instance tomorrow to see if I can replicate your issue.

dekinsitro commented 5 years ago

Interesting suggestion. Please do try to reproduce this problem with a modern (18.04 Ubuntu) VM if possible. I'm basically doing either "conda install -c conda-forge adam" or "pip install bdgenomics.adam" then trying to run a basic transformAlignments on an s3-sourced file

heuermh commented 5 years ago

Sorry for dropping this for a while, I'll try to replicate this later this week with the new 0.27.0 release.