bigdatagenomics / adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Apache License 2.0
1k stars 308 forks source link

BAM/BED to parquet #2376

Closed darked89 closed 2 years ago

darked89 commented 2 years ago

Hello,

Would it be possible to provide a minimal example be it in Scala/python/CLI, how to convert say BAM to an ADAMs parquet? Same with a canonical 6 columns BED.

DK

heuermh commented 2 years ago

Command line

$ adam-submit transformAlignments sample.bam sample.alignments.adam
$ adam-submit transformFeatures annotation.bed annotation.features.adam

Scala

import org.bdgenomics.adam.ds.ADAMContext._

val alignments = sc.loadAlignments("sample.bam")
alignments.saveAsParquet("sample.alignments.adam")

val features = sc.loadFeatures("annotation.bed")
features.saveAsParquet("annotation.features.adam")

Python

from bdgenomics.adam.adamContext import ADAMContext
ac = ADAMContext(sc)

alignments = ac.loadAlignments("sample.bam")
alignments.saveAsParquet("sample.alignments.adam")

features = ac.loadFeatures("annotation.bed")
features.saveAsParquet("annotation.features.adam")

Hope this helps!

darked89 commented 2 years ago

Thank you very much for such a quick answer.

Bit of a follow up: the resulting .adam files are in a parquet format readable by say arrow?

heuermh commented 2 years ago

Yes, I've never had any issues with Parquet in Apache Arrow. There was a mis-specification between the JVM Parquet and the C++ Parquet with regards to LZ4 compression at some point, I don't know if that is still a problem. Other compression algorithms should be fine.

I did have some issues with incomplete support for Parquet via DuckDB, details here https://github.com/heuermh/bdg-formats-duckdb

As of that effort, DuckDB did not support Parquet enums or nested schema, both features that we use in bdg-formats/ADAM.

darked89 commented 2 years ago

Hello,

I can confirm that so far I have no issues reading parquet files created by ADAM using python polars. The only a bit confusing thing was with a test RNA-Seq BAM produced by STAR (2x 150bp reads) where somehow I got min insert size= -911256.0. Is it a true insert size or a location offset of a second read in the pair?

As for the .bed to adam/parquet, I noticed that the 6 column bed got transformed into 26 column parquet with obviously empty columns for values not in the input. Not a problem, just a note that the parquets created from BED files contain such extra slots.

Well, this should let me start experimenting with ADAM after getting back from vacations.

Many thanks for your help

Darek Kedra

heuermh commented 2 years ago

As for the .bed to adam/parquet, I noticed that the 6 column bed got transformed into 26 column parquet with obviously empty columns for values not in the input. Not a problem, just a note that the parquets created from BED files contain such extra slots.

We use rather rich schema for all the various genomic data types, defined in Avro at https://github.com/bigdatagenomics/bdg-formats

The Feature schema was designed to support all of GFF2/GTF, GFF3, BED, Genbank, NarrowPeak, and IntervalList formats. A chart with attribute mappings can be found at https://github.com/heuermh/bdg-formats/blob/docs/docs/source/features.md