FilipDominec / nihilnovi

Browse and compare scientific data files like you browse your photo gallery (and process them with Python+Matplotlib)
MIT License
10 stars 4 forks source link

Replace liborigin with generated parser #2

Open KOLANICH opened 7 years ago

KOLANICH commented 7 years ago

Hi. I was searching GH for documentation on origin file format to write its description in machine-readable way to generate a parser and found

trace the memleaks and errors in the liborigin code

in your markdown readme. Are you familiar with origin format well enough to either give me an overview of it (the pieces missing from https://github.com/jgonera/openopj/blob/master/docs/opj_format.markdown) or fill the gaps in my code?

meta:
  id: origin
  file-extension: opj
  application: Origin
  endian: be
doc: >
  https://github.com/jgonera/openopj/blob/master/docs/opj_format.markdown
seq:
  - id: signature
    type: signature
  - id: header
    type: header
  - id: data_list
    type: data_list
  - id: window_list
    type: window_list
  - id: parameters_section
    type: parameters_section
  - id: note_list
    type: note_list
  - id: project_tree
    type: project_tree
  - id: attachment_list
    type: attachment_list
types:
  cpya_ver:
    seq:
      - id: major
        type: strz
        terminator: "."
      - id: minor
        type: strz
        terminator: " "
  signature:
    seq:
      - id: sig_seq
        content: "CPYA "
      - id: cpya_ver
        type: cpya_ver
      - id: build_number
        type: strz
        terminator: "#"
      - id: line_terminator
        contents: "\n"
  header_data_block:
    seq:
      - id: unknown
        size: 27
      - id: version
        type: f8
      - id: unknown1
        size: 4
  data_section:
    seq:
      - id: header
        size: 27
      - id: content
        type: f8
      - id: null_block
        type: null_block
  data_header_flags:
    seq:
      - id: unkn0
        type: b4
      - id: integers
        type: b1
      - id: unkn1
        type: b2
      - id: text_and_numeric
        type: b1
      - id: unkn2
        type: b8
  text_or_numeric:
    doc: "In case of Text values, the value is a null-terminated string. The bytes after the null are garbage or earlier contents of the value and can be disregarded."
    seq:
      - id: unkn0
        type: u8
        doc: "If the first byte is equal 0, the value is a double, if it's 1, the value is a string."
      - id: unkn1
        contents: [0]
        doc: " The second prefix byte seems to be always 0."
      - id: value
        type: 
          switch-on: unkn0
          cases:
            0: f8
            1: strz
  data_header:
    seq:
      - id: unknown0
        size: 22
      - id: flags
        type: data_header_flags
      - id: data_type2
        type: u1
      - id: total_rows
        type: u4
      - id: first_row
        type: u4
      - id: last_row
        type: u4
      - id: unknown1
        size: 24
      - id: value_size
        type: u1
      - id: unknown2
        type: u1
      - id: data_type_u
        type: u1
      - id: unknown3
        size: 24
      - id: data_name
        type: strz
        size: 25
        doc: "Data name, for worksheets it's \"WORKSHEET_COLUMN\". Column is at most 18 chars long, remaining characters are used for \"_\", terminating null byte and worksheet name which may be truncated if too long."
      - id: data_type3
        type: u2
        doc: "According to [importOPJ][] the bytes starting at 0x0071 (start of this field) didn't exist before Origin 5.0."
      - id: unknown4
        type: u8
        doc: "Always zeros?"
    instances:
      rows:
        doc: >
          The data content block consists of consecutive values (e.g. consecutive rows in a worksheet column). The number of rows is `totalRows`. `firstRow` indicates the first non-empty row (0 is the first row) and `lastRow` indicates the last non-empty row.
          It should be enough to parse the values up to `lastRow` and skip the remaining ones.
          The format of the value seems to depend on `valueSize` and `dataType`:
             According to [liborigin][] `dataTypeU = 8` means that an integer value is unsigned (this is not verified).
             `valueSize = 1` seems to be rare or non-existent, at least in Origin 7.0552.
        size: value_size
        repeat: expr
        repeat-expr: total_rows
        type:
          switch-on: value_size
          cases:
            1: u1
            2: u2
            4: u4 #or f4
            8: f8
            #valueSize > 8: Text
            #valueSize > 8 and dataType & 0x100: text_or_numeric

  window_section_header:
    doc: >
      As of now, the description of window list and its subsections is incomplete and merely serves as an indication of how to skip to the parameters section.
      window_section contains a header block and a layer list.
    seq:
      - id: unknown0
        size: 2
        doc: "Unknown, always zero?"
      - id: name
        type: strz
        size: 25
      - id: unknown2
        size: n
        doc: "See importOPJ for details"
  ### Parameters section
  parameters_section_header:
    doc: >
      This section does not contain blocks. Instead it contains an arbitrary number of parameter elements.
      window_section contains a header block and a layer list.
      The last parameter element is followed by a 0 byte and a line feed (`00 0A`), i.e. if you encounter a parameter name equal to "\0", there are no more parameter elements.
      The section ends with a null block (or possibly it's another section which is usually empty).
    seq:
      - id: name
        size: n
        type: str
      - id: lf0
        contents: "\n"
      - id: value
        type: f8
      - id: lf1
        contents: "\n"
FilipDominec commented 7 years ago

Hi, unfortunately I am not familiar with the OPJ format. However, I am quite interested in helping you build such a parser, since the liborigin2 library often causes severe memory leaks which lead to filling of RAM and eventual crash of the application after several minutes of unusable workplace.

The author of liborigin2 also does not look like willing to collaborate too much, stating that "e-mails from persons "planning to work" will be ignored" on their website http://soft.proindependent.com/liborigin2/, and the last update of the website is nearly six years old. So I think there is a real need for a up-to-date OPJ parser.