DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
702 stars 269 forks source link

Built a Docker container for ARM64 (M1 Macs) #624

Open mikesha2 opened 2 years ago

mikesha2 commented 2 years ago

Trying to clear up some confusion in the manual.

There's really no need to maintain MacOS compatible builds as it states in the manual.

kraken2 can be built on the arm64v8/ubuntu image (it's in the package registry), which runs at essentially native speed in Docker Desktop for M1 Macs. I suggest removing the entire paragraph about building on Mac, and changing it to "Use Docker Desktop if on M1 Macs, and pull the arm64v8 image."

https://hub.docker.com/r/cms6712/kraken2

image

For reference, I was doing the stupid thing and pulling the x64 image from staphb/kraken2, and running kraken2 via emulation, which was about 20x slower.

The Dockerfile is also literally only 2 lines

FROM arm64v8/ubuntu
RUN apt-get update && apt-get install kraken2 -yq
derekstein commented 1 year ago

Kind of new to this, first you would need an ubuntu server set up and then install docker? Kind of confused on how to get this all installed if your starting from scratch ...

mikesha2 commented 1 year ago

Hi! To explain it simply:

Docker implements a thin layer between the host OS (in this case macOS) and the Linux kernel. This means that once a container is built, you can run it on arbitrary machines, through a Docker container.

This magic only works when the architecture is the same (for example running an x86_64 Linux container on an x86_64 processor). Otherwise, Docker will resort to emulation of x86_64, which is far slower. For Apple Silicon Macs, the CPU architecture is ARM64.

Fortunately, the people over at Canonical spent a lot of effort making an ARM64 version of Ubuntu. Additionally, the people at Docker made an ARM64 version of Docker which runs at native speed on Apple Silicon. The result is this:

  1. The ARM64 version of Ubuntu has pre-built, native binaries of the package library (apt repository), which includes kraken2.
  2. We can build an ARM64 Docker container with kraken2 installed, using the Dockerfile I wrote above.
  3. This container runs at near-native speed on Apple Silicon, because it's an ARM64 container running on ARM64 hardware.

That ARM64 Docker container is located at the link I posted above (https://hub.docker.com/r/cms6712/kraken2).

In short: All the end user needs to do is install Docker for Mac, pull the linked Docker container, and download/build a kraken2 database.

You can probably replace the instructions for Mac support with:

  1. Download Docker for Mac: https://docs.docker.com/desktop/install/mac-install/
  2. Run docker pull cms6712/kraken2 from Terminal
  3. Download/build an appropriate kraken2 database, e.g. https://benlangmead.github.io/aws-indexes/k2

Does that help?

derekstein commented 1 year ago

I think its more clear. But I would need Ubuntu running on my Mac as well no? At present I don't ...

mikesha2 commented 1 year ago

No. The point is that Docker implements the Linux kernel, and the container image is literally Ubuntu with kraken2 installed. You don’t do anything except follow those three steps.

derekstein commented 1 year ago

got it working! do I have to have docker open and running every time I want to use kraken? Currently trying to build the database! Its taking forever, also doesn't help that my computer shut down overnight for a software update ... lol

derekstein commented 1 year ago

I am having issues when classifying. Is there a way to assign ram? I get this error: Loading database information...classify: Error reading in hash table. I am running an m1 max with 64gb of ram. The database is the 16gb hash file so there should be enough. Any help from a fellow Mac user would be helpful!

mikesha2 commented 1 year ago

I would suggest using a pre-built database, as linked above: https://benlangmead.github.io/aws-indexes/k2

mikesha2 commented 1 year ago

As you can see, some of the databases get quite large (I see one that's 96.3 GB), which is probably meant for classifying on a cluster. I get pretty good results with the databases capped at 16 GB

derekstein commented 1 year ago

https://benlangmead.github.io/aws-indexes/k2

The link above is exactly where I downloaded the database from. No issues there. However I get this error when trying to classify. "Loading database information...classify: Error reading in hash table"

mikesha2 commented 1 year ago

Which one did you download? Just checked and mine still works fine with k2_pluspf_16gb_20220607

image
derekstein commented 1 year ago

I downloaded k2_standard_08gb_20220926.

mikesha2 commented 1 year ago

Try a slightly older one.

They should just work:

image
mikesha2 commented 1 year ago

Do you have paired reads, or two single direction reads?

I'm running the following command:

kraken2 --db path/to/k2_pluspf_16gb_20220607/ file_1.fastq.gz file_2.fastq.gz > outputFile

image
derekstein commented 1 year ago

Its actually nanopore data. I generated the fastq using the new ONT dorado package.

This is what I am running:

% /Users/derekstein/kraken2/kraken2-master/kraken2-dir/kraken2 --db /Users/derekstein/kraken2/kraken2-master/k2_pluspfp_16gb_20220607 /Users/derekstein/vsc_projects/dorado/fastq/test.fastq > outputFile

and I get this error:

Loading database information...classify: Error reading in hash table