Closed darked89 closed 2 years ago
Command line
$ adam-submit transformAlignments sample.bam sample.alignments.adam
$ adam-submit transformFeatures annotation.bed annotation.features.adam
Scala
import org.bdgenomics.adam.ds.ADAMContext._
val alignments = sc.loadAlignments("sample.bam")
alignments.saveAsParquet("sample.alignments.adam")
val features = sc.loadFeatures("annotation.bed")
features.saveAsParquet("annotation.features.adam")
Python
from bdgenomics.adam.adamContext import ADAMContext
ac = ADAMContext(sc)
alignments = ac.loadAlignments("sample.bam")
alignments.saveAsParquet("sample.alignments.adam")
features = ac.loadFeatures("annotation.bed")
features.saveAsParquet("annotation.features.adam")
Hope this helps!
Thank you very much for such a quick answer.
Bit of a follow up: the resulting .adam files are in a parquet format readable by say arrow?
Yes, I've never had any issues with Parquet in Apache Arrow. There was a mis-specification between the JVM Parquet and the C++ Parquet with regards to LZ4 compression at some point, I don't know if that is still a problem. Other compression algorithms should be fine.
I did have some issues with incomplete support for Parquet via DuckDB, details here https://github.com/heuermh/bdg-formats-duckdb
As of that effort, DuckDB did not support Parquet enums or nested schema, both features that we use in bdg-formats/ADAM.
Hello,
I can confirm that so far I have no issues reading parquet files created by ADAM using python polars. The only a bit confusing thing was with a test RNA-Seq BAM produced by STAR (2x 150bp reads) where somehow I got min insert size= -911256.0. Is it a true insert size or a location offset of a second read in the pair?
As for the .bed to adam/parquet, I noticed that the 6 column bed got transformed into 26 column parquet with obviously empty columns for values not in the input. Not a problem, just a note that the parquets created from BED files contain such extra slots.
Well, this should let me start experimenting with ADAM after getting back from vacations.
Many thanks for your help
Darek Kedra
As for the .bed to adam/parquet, I noticed that the 6 column bed got transformed into 26 column parquet with obviously empty columns for values not in the input. Not a problem, just a note that the parquets created from BED files contain such extra slots.
We use rather rich schema for all the various genomic data types, defined in Avro at https://github.com/bigdatagenomics/bdg-formats
The Feature
schema was designed to support all of GFF2/GTF, GFF3, BED, Genbank, NarrowPeak, and IntervalList formats. A chart with attribute mappings can be found at
https://github.com/heuermh/bdg-formats/blob/docs/docs/source/features.md
Hello,
Would it be possible to provide a minimal example be it in Scala/python/CLI, how to convert say BAM to an ADAMs parquet? Same with a canonical 6 columns BED.
DK