TresAmigosSD / SMV

Spark Modularized View
Apache License 2.0
42 stars 22 forks source link

Add format in schema file #113

Closed ninjapapa closed 6 years ago

ninjapapa commented 9 years ago

Although we can focus on supporting only CSV internally, there are cases the input data is in fixed record length format.

AliTajeldin commented 9 years ago

Agree. I think from a user perspective, this needs to be a mere schema change (which would affect both the input and output). May need to extend the schema definition to support this.

ninjapapa commented 9 years ago

Need to add the location info to the schema file. Here is the SAS format for a typical fix record length file

    @ 1   CASENUM              $char8.  /* Patient ID number*/
    @ 9   REG                  $char10. /* Registry ID */
    @ 19  MAR_STAT             $char1.  /* Marital Status at DX */
    @ 20  RACE                 $char2.  /* Race/Ethnicity */
    @ 22  ORIGIN               $char1.  /* Spanish/Hispanic Origin */
    @ 23  NHIA                 $char1.  /* NHIA Derived Hispanic Origin */
    @ 24  SEX                  $char1.  /* Sex */
    @ 25  AGE_DX               $char3.  /* Age at diagnosis */
    @ 28  YR_BRTH              $char4.  /* Year of birth */
    @ 35  SEQ_NUM              $char2.  /* Sequence Number--Central */

If the Schema will cover all the columns, the position info as in the SAS format file becomes redundant. Also, since we will only support test data, the keyword $char could also be removed.

So the Schema file could either be

Field1: String @13, 2
Field2: Integer @24, 8
...

or

Field1: String 2
Field2: Integer 8
...

We may need to separate the CSV handling from Schema handling, even if we want the Schema file to support both CSV and FRL files.

ninjapapa commented 9 years ago

I'm kind of debating how far we want to go on the input/out data format supporting path. If fixed record length is the only one we want to add to CSV, we can make the support Schema (actually SchemaEntry) level. If there are potentials that we need to support other formats, we need to think whether we should do it more systematically. Anyhow I don't think it should be SMV's core function.

ninjapapa commented 9 years ago

f45fa91f8535cecce036b532d9bf93a33ad47078 Added minimal fixed record length file read support.

ninjapapa commented 9 years ago

As discussed, will add format to all schema entries in schema file as

field1: Integer[5d]
field2: String[12c]
field3: SSN[3d-3d-4d]

format will not be propagated through calculation, so any need to output fixed record length data need to enforce schema format at the end of calculation. Anything persisted in the middle will be still in CSV and without the format.

Will start with only 2 type of formats:

Potentially to add more later.

ninjapapa commented 9 years ago

Renamed the issue title

laneb commented 7 years ago

Is this issue resolved?

ninjapapa commented 7 years ago

No. We do have fixed record length data support, but we don't have the "format in schema" support as described in "Feb 26" message. No so sure whether we still need it.

ninjapapa commented 6 years ago

Not fix. Reopen if eventually we run into this need again.