ababaian / bioSyntax-archive

Syntax highlighting for computational biology
http://bioSyntax.org
GNU General Public License v3.0
16 stars 2 forks source link

Selecting Arbitrary Nth Column in Syntax #17

Closed ababaian closed 6 years ago

ababaian commented 6 years ago

I was working on the mostly trivial case of fasta-index format (faidx) and I think because it was so simple I found a very nice way to select columns by the order in which they appear. The only requirement right now is that it is in a tab-delimited file.

What it does is match the first column until the first tab, scopes it, then pushes to contig.length

In contig.length every non-whitespace character is selected and scoped. Then when it hits the next tab it pops out.

The third column is then selected, scoped and pushed to genomic.offset. The fourth column is selected and then popped at the tab.

etc... This push-pop back and forth with tabs can be repeated for N number of columns which means that .bed, .bedpe, .gtf, .sam, and possibly some of .vcf can now be 'solved' since we know what type of data is supposed to be in the Nth column.

Can anyone think of a reason that this won't work or will break at some edge-case?

If not, we'll need to re-work those syntaxes as I think this is a more robust approach then trying to select each column by the data range which could be there.

faidx.sublime-syntax

%YAML 1.2
---
name: faidx
file_extensions: [fa.fai,fasta.fai]
scope: source.faidx

contexts:
  main:
    # COLUMN 1
    - match: '^[\S]*\t'
      scope: coord.Chr.faidx
      push: contig.length

    # COLUMN 3
    - match: '(?<=\t)[\S]*\t'
      scope: constant.numeric.faidx
      push: genomic.offset

    # COLUMN 5
    - match: '[\S]*$'
      scope: comment.line.faidx

  contig.length:
    # COLUMN 2
    - match: '[\S]*'
      scope: coord.Start.faidx
    - match: \t
      pop: true

  genomic.offset:
    # COLUMN 4
    - match: '[\S]*'
      scope: comment.line.faidx
    - match: \t
      pop: true
pan-chu commented 6 years ago

This is really cool.

I think it looks pretty robust as it is. Though, would it work for files with >5 columns? Might need to do some figuring out on how to encode the 5th column if there is a 6th column. Maybe something along the lines of '(?<=\t[\S]\t)[\S]\t'?

ababaian commented 6 years ago

Even simpler version with an open-ended scope for all columns greater then 5.

Robust Nth Column Selection

%YAML 1.2
---
name: faidx
file_extensions: [fa.fai,fasta.fai]
scope: source.faidx

# Fasta Index Filetype Description
# NAME  Name of this reference sequence
# LENGTH  Total length of this reference sequence, in bases
# OFFSET  Offset within the FASTA file of this sequence's first base
# LINEBASES The number of bases on each line
# LINEWIDTH The number of bytes in each line, including the newline

contexts:
  main:
    # COLUMN 1
    - match: '^[\S]*\t'
      scope: coord.Chr.faidx
      push: col2

  col2:
    # COLUMN 2
    - match: '[\S]*'
      scope: coord.Start.faidx
    - match: \t
      push: col3
    - match: $
      pop: true

  col3:
    # COLUMN 3
    - match: '[\S]*'
      scope: constant.numeric.faidx
    - match: \t
      push: col4
    - match: $
      pop: true

  col4:
    # COLUMN 4
    - match: '[\S]*'
      scope: comment.line.faidx
    - match: \t
      push: col5
    - match: $
      pop: true

  col5:
    # COLUMN 5
    - match: '[\S]*'
      scope: comment.line.faidx
    - match: \t
      push: colast
    - match: $
      pop: true

  colast:
    # Any COLUMN >5
    - match: .*
      scope: comment.line.faidx
      pop: true
pan-chu commented 6 years ago

brilliant!

ababaian commented 6 years ago

I think this same logic could be applied for gedit and Vim syntax as well. There is a Match Start // Match End logic which can be extended in this way. I would say if we figure this out soon we'll simplify our lives greatly.

Maybe read some syntax highlighting files for other complex langauges (C / XML etc...) to learn how other people solved similar problems.

Ebedthan commented 6 years ago

Can we get a screenshot of what it looks like @ababaian ?

ababaian commented 6 years ago

faidx

Ebedthan commented 6 years ago

Please can you give me the colors you used to do this colors scheme?

ababaian commented 6 years ago

I'd say let's not worry 100% about all the color schemes just yet. This was based off of bioMonokai for Sublime which is dark background. Gedit is based off of Kate and is light background so it might not work. The third column is simply the default 'numeric' color, fourth + fifth are comment colored.

We're going to have to formalize all the colors and/or set one dark one light theme to make the same for all the different programs. We can worry about this last; now we need the syntax files to work reliably for all the different software as the highest priority.

ababaian commented 6 years ago

Also faidx-gedit syntax

Check out Fasta Index Language File for an example of the logic. It's the same thing as in sublime / less where nested contexts can be used to select by column. This should make SAM/VCF/BED/GTF files much much easier to deal with.

faidx-gedit