ababaian / bioSyntax-archive

Syntax highlighting for computational biology
http://bioSyntax.org
GNU General Public License v3.0
16 stars 2 forks source link

Porting to less #16

Closed ababaian closed 6 years ago

ababaian commented 6 years ago

Two formats, .sam and .vcf, are often very large and cannot be opened quickly in vim or any other text editor without loading to memory (although vim is decent if you have enough memory). This can be sort of solved by using head. The better solution is using less for .sam and .vcf. So can we have syntax highlighting there for these important formats?

We can leverage the source-highlight package to accomplish this. I believe the syntax-language files may be shared with gedit which will save work on that end.

Installing source-highlight in less (Ubuntu)

1) Install source-highlight to your system:

sudo apt-get update
sudo apt-get install source-highlight

2) Append these lines to your ~/.bashrc and/or ~/.zshrc


## Syntax highlighting in less
## For Ubuntu / Fedora
export LESSOPEN="| /usr/share/source-highlight/src-hilite-lesspipe.sh %s"
export LESS=" -R "

alias less='less -NSi -# 10'
alias more='less'

# Explicit fasta / sam less call for piping
# i.e:   samtools view -h aligned_hits.bam | sam-less
#
alias fa-less='source-highlight -f esc --lang-def=fasta.lang --outlang-def=bioSyntax.outlang --style-file=fa.style | less'
alias sam-less='source-highlight -f esc --lang-def=sam.lang --outlang-def=bioSyntax.outlang --style-file=sam.style | less'
alias vcf-less='source-highlight -f esc --lang-def=vcf.lang --outlang-def=bioSyntax-vcf.outlang --style-file=vcf.style | less'

Note: On different systems the /usr/share/source-highlight/src-hilite-lesspipe.sh may be installed to a different directory. (i.e CentOS: export LESSOPEN="| /usr/bin/src-hilite-lesspipe.sh %s")

Installing bioSyntax for less (Ubuntu)

1) Update the src-hilite-lesspipe.sh script in the source-highlight directory.

# source-highlight directory on your system
SRCDIR='/usr/share/source-highlight'

cd  $bioSyntax_PATH/syntax/less/

sudo cp src-hilite-lesspipe_BIO.sh $SRCDIR/src-hilite-lesspipe.sh

2) Copy over the *.lang, .outlang and .syntax files to the source-highlight directory.

#!/bin/bash
# quickInstall.sh
# Quick installer for less syntax
# for testing purposes

SRCDIR='/usr/share/source-highlight'

# Copy over src-hilite script
sudo cp src-hilite-lesspipe_BIO.sh $SRCDIR/src-hilite-lesspipe.sh

# Copy over language files
sudo cp fasta.lang $SRCDIR/
sudo cp sam.lang $SRCDIR/
sudo cp vcf.lang $SRCDIR/

# Copy over syle files
sudo cp fasta.style $SRCDIR/
sudo cp sam.style $SRCDIR/
sudo cp vcf.style $SRCDIR/

# Copy over language files
sudo cp bioSyntax.outlang $SRCDIR/
sudo cp bioSyntax-vcf.outlang $SRCDIR/

3) Restart your computer for the rc file to update in your terminal.

Running bio-aware less

1) Automatic detection of file-extensions when reading entire file *.fa, *.fasta, *.sam less hgr1.fa

2) Piping requires explicit use of fa-less, sam-less or vcf-less which can be combined in all the interesting ways you can come up with. samtools view -h accepted_hits.bam | sam-less

Developing language syntax files (ongoing)

We define a single <language>.lang and <language>.style per language & bioSyntax.outlang file for fasta.lang, sam.lang and bioSyntax-vcf.outlang for vcf.lang file each to get less working.

Known Bugs

ababaian commented 6 years ago

Basic fasta syntax highlighting working in less. Note: the limited color palette at the moment is limiting the ability to add amino-acid coloring support. less_fasta

ababaian commented 6 years ago

Most of .sam is done for less along with detailed installation instructions for use in making the installer. I'll tweak this a bit more.

sam-less_command sam-less sam-less_2

ababaian commented 6 years ago

I've completely ported VCF to less and updated the installer instructions. I found another way to do robust column expression in this syntax language. VCF Files with long rows (100+ samples) take a long time to parse, about 3-4 lines per second. A better programmer then me should look into optimizing the regex code for faster coloring. It's well usable though.

With 1000genomes data (1000 columns in examples/vcf/test_1000genomes.vcf.gz ) it takes 15+ seconds to load a row! Needs work. If someone wants to do some heavy-lifting optimization-wise, don't lose any features and optimize the regex. @lazypanda10117 perhaps?

Before

less-vcf_before

After

less-vcf

alyeffy commented 6 years ago

Time complexity for regular expressions

ababaian commented 6 years ago

I was doing a little bit of work and I realized that I often would read fq files with less while streaming it from gzip (.fq.gz). So I've added .fq .fai .flagstat .bed .gtf support for less so it's more complete now.

We should probably do .pdb as well so it's "complete"

gtf-less gtf-less_gencode