astrolabsoftware / spark-fits

FITS data source for Spark SQL and DataFrames
https://astrolabsoftware.github.io/spark-fits/
Apache License 2.0
20 stars 7 forks source link

header challenge: cannot infer size of type B from the header #59

Closed jacobic closed 5 years ago

jacobic commented 5 years ago

Here is some feedback about an error reading unsigned bytes in fits files.

Keep up the good work! I love this spark package :)

The following error is thrown when calling:

df = sqlc.read.format("fits").option("hdu", 1).load(path)

"FitsLib.getSplitLocation> Cannot infer size of type B
            from the header! See com.astrolabsoftware.sparkfits.FitsLib.getSplitLocation"

An example of the header is below (the FLAG_* columns are the ones causing the problem: example.txt

It looks like the issue is due to not having a case for shortType.contains("B"):

def getSplitLocation(fitstype : String) : Int = {
      val shortType = FitsLib.shortStringValue(fitstype)

      shortType match {
        case x if shortType.contains("I") => 2
        case x if shortType.contains("J") => 4
        case x if shortType.contains("K") => 8
        case x if shortType.contains("E") => 4
        case x if shortType.contains("D") => 8
        case x if shortType.contains("L") => 1
        case x if shortType.endsWith("X") => {
          // Example 16X means 2 bytes
          x.slice(0, x.length - 1).toInt / BYTE_SIZE
        }
        case x if shortType.endsWith("A") => {
          // Example 20A means string on 20 bytes
          x.slice(0, x.length - 1).toInt
        }
        case _ => {
          println(s"""
            FitsLib.getSplitLocation> Cannot infer size of type $shortType
            from the header! See com.astrolabsoftware.sparkfits.FitsLib.getSplitLocation
              """)
          0
        }

Thanks again, Jacob

JulienPeloton commented 5 years ago

Hi @jacobic,

Thanks for using the code and reporting the bug! I have opened a PR (#60), which includes tests. But could you test the modifications (branch: UBtypeFix) with your data and let me know if that solves the issue?

Thanks! Julien

jacobic commented 5 years ago

Works like a charm! Thanks so much for implementing a fix so quickly.

Cheers, Jacob

JulienPeloton commented 5 years ago

Great, thanks for checking it! I will merge the changes, and they will be available on the next release (0.7.2) on the central repository.

Out of curiosity: in which context are you using the package?

jacobic commented 5 years ago

Hi Julien,

I am creating a pipeline to optically confirm clusters of galaxies that will be detected in X-rays by eROSITA (http://www.mpe.mpg.de/eROSITA). This requires a large number of photometric catalogs to be processed. Spark DataFrames make aggregating over the galaxy clusters much easier and faster than pure python and so spark-fits and spark.ml are the perfect packages for me! :)

Python 3.7 / Scala 2.11.8 / Spark 2.3.2 (upgrading to 2.40 in a few days) running in stand-alone mode on a HPC system with GPFS (https://www.mpcdf.mpg.de/services/data/application-support/spark).

Cheers, Jacob

JulienPeloton commented 5 years ago

Hi Jacob,

Thanks! This sounds super exciting :-) Do not hesitate to bug me if you encounter problems or limitations with the package.

Julien