Missing Fastq reads - Githubissues

bigdatagenomics / adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

Apache License 2.0

1k stars 308 forks source link

Hello @SidWeng!

I have seen issues occasionally with gzipped/bgzf FASTQ input before, although typically with paired reads, where ADAM complains about not having the same numbers of each. If you know of any publicly available datasets that demonstrate this issue, I can dig into it deeper.

As a workaround, you may be able to convert to unaligned BAM format first and then read into ADAM.

Another workaround would be to convert FASTQ into CSV or tab-delimited format and then use Spark SQL to read the text file and convert into ADAM format, something like

import org.bdgenomics.adam.ds.ADAMContext._

val sql = """
SELECT
  _c0 AS name,
  CAST(NULL AS STRING) AS description,
  'DNA' AS alphabet,
  upper(_c1) AS sequence,
  length(_c1) AS length,
  _c2 AS qualityScores,
  CAST(NULL AS STRING) AS sampleId,
  CAST(NULL AS MAP<STRING,STRING>) AS attributes
FROM
  reads
"""

val df = spark.read.option("delimiter", "\t").csv(inputPath)
df.createOrReplaceTempView("reads")
val readsDf = spark.sql(sql)
val reads = sc.loadReads(readsDf)
reads.saveAsParquet(outputPath)

bigdatagenomics / adam

Missing Fastq reads #2385