jalview output - Githubissues

martin-raden commented 6 years ago

Hi @PatrickRWright @eggzilla @JensGeorg ,

we have Jalview now ready via bioconda and I have created a small shell script (see below) that produces for all subdirs of evo_alignments (=default or uses provided dir) according svg|eps|png figures (default svg only).

So a few questions come up:

should this script be part of the CopraRNA package at github? (=postprocessing option)
- if yes, I will create an according pull request including README docu update
if yes, should the CopraRNA bioconda recipe install jalview?

For the webserver I will:

write a script that
- reduces the evo_alignments folder to the entries from the final websrv result table
- renames all remaining folders (and contained files) to rank_x/rank_x_...
use the given jalview wrapper script to produce images for the resulting folders
show the images in the result page for the selected row (below the shown interaction)

Anything forgotten?

Please give me your feedback, thanks!

#!/usr/bin/env bash

##############################
#
# DEPENDENCY: bioconda 'jalview' package (or according wrapper script in $PATH)
#
# Runs Jalview for [SVG]-depictions of whole CopraRNA alignment output (typically in 'evo_alignments' subdirectory).
# 
# To this end, it iterates over all subdirectories in the provided directory (argument 1).
# If no argument is provided, it processes local directory 'evo_alignments' in PWD
#
# You can alter/extend the produced depiction output with the second and subsequent arguments by providing a subset of
# [svg,png,eps]
#
##############################

# check if bioconda's Jalview executable 'jalview' is available
hash jalview 2>/dev/null || { echo >&2 "ERROR: expected 'jalview' bioconda wrapper script for Jalview not found in \$PATH"; exit 1; }

# get directory to process (with given default directory)
DIR2PROCESS="evo_alignments";
if [ "$#" -ge 1 ]; then
    DIR2PROCESS=$1;
fi
# check if directory is existing
if [ ! -d "$DIR2PROCESS" ]; then
    echo >&2 "ERROR: directory '$DIR2PROCESS' can not be accessed"; exit 1;
fi
shift; # remove directory from argument list

# get output formats
OUTFORMAT="svg";
if [ "$#" -gt 0 ]; then
    OUTFORMAT="";
    while [ "$#" -gt 0 ]; do
        # check if valid output selection
        if [[ $1 =~ ^(svg)|(eps)|(png)$ ]] ; then 
            OUTFORMAT="$1 $OUTFORMAT";
        else
            echo >&2 "ERROR: output format '$1' is not supported, use a subset of [svg,png,eps]"; exit 1;
        fi
        shift;
    done
fi

# iterate on all subdirectories
for d in `ls $DIR2PROCESS`; do
    # create list of output files
    OUTPUTARGS="";
    for o in $OUTFORMAT; do
        OUTPUTARGS="-$o $d/$d.$o $OUTPUTARGS";
    done
    # call jalview
    jalview -nodisplay -open $d/$d_mRNA.fasta -features $d/$d_mRNA_anno.txt -annotations $d/$d_mRNA_annotation.txt $OUTPUTARGS;
done

eggzilla commented 6 years ago

Hey, I think the best option would be to add a commandline switch that first checks if jalview is present and if yes creates the corresponding output files and just prints a warning if not. I would recommend not to add the jalview recipe to the default coprarna dependencies.

martin-raden commented 6 years ago

ok, the script already checks whether jalview is available and aborts with error message otherwise.

@eggzilla do you think the script should be part of the github distribution?

eggzilla commented 6 years ago

I would add the script to coprarna_aux, but maybe do the check with the other sanity checks in the main script. Then they are bundled in one place :-)

martin-raden commented 6 years ago

mhh.. I disagree since I see this script as an optional postprocessing script for a coprarna job. so it should IMO not be run by the coprarna pipeline itself...

eggzilla commented 6 years ago

I am fine with both solutions, however jens false-positive removal post-processing script is now also triggerable by commandline switch (--ooifilt). Or do you mean to add the switch and leave the check in the script, so it works also on its own.

martin-raden commented 6 years ago

second script for top-X-cleanup is ready:

#!/usr/bin/env bash

#####################################
#
# DEPENDENCY: CopraRNA called with '-websrv', which should result in
#  - subdir 'evo_alignments'
#  - file   'CopraRNA_result.map_evo_align'
#
# 1) removes all folders from 'evo_alignments' that are not named in 'CopraRNA_result.map_evo_align'
#
# 2) renames all folders in 'evo_alignments' (and their content) to 'rank_X' where X is the line number in 'CopraRNA_result.map_evo_align'
#
#####################################

IDFILE=CopraRNA_result.map_evo_align;
EVODIR=evo_alignments

# check for required data
test -e $IDFILE  || { echo >&2 "ERROR: expected file '$IDFILE' not found.."; exit 1; }
test -d $EVODIR  || { echo >&2 "ERROR: expected subdir '$EVODIR' not found.."; exit 1; }

# generate list of folders IDs to maintain
RANKEDIDS=`cat $IDFILE | grep -P '^\d+$' | tr '\n' ' ' `
# generate search string for ranked ids enclosed by '_'
RANKEDIDPATTERN=`echo $RANKEDIDS | tr ' ' '_' `;
RANKEDIDPATTERN="_${RANKEDIDPATTERN}_";

# remove unnecessary subfolders
for d in $EVODIR/*; do 
    # get ID from subfolder name
    CURSUBDIR=`echo $d | tr "/" " " | awk '{print $2;}'`;
    CURID=`echo $d | tr "/_" "  " | awk '{print $3;}'`;
    # check if in RANKEDIDS
    if [[ $RANKEDIDPATTERN == *"_${CURID}_"* ]]; then
        # get rank of file
        CURRANK=`grep -P "^${CURID}$" -n $IDFILE | awk -F ':' '{print $1}'`;
        # rename files in folder to 'rank_$CURRANK_*'
        for file in `ls $d/*`; do 
            fileNEW="${file/${CURSUBDIR}_/rank_${CURRANK}_}"; 
            mv ${file} ${fileNEW};
        done
        # rename folder to 'rank_$CURRANK'
        mv $d rank_$CURRANK;
    else 
        # remove subfolder
        rm -rf $d;
    fi
done

martin-raden commented 6 years ago

@PatrickRWright @JensGeorg I discussed with @eggzilla the following pipeline:

the standard CopraRNA2.pl call should to avoid unnecessary file number explosion on the user's harddrive. thus is should at the end:

compress the whole evo_alignments folder to evo_alignments.zip
remove the folder afterwards

e.g. using

zip -rmq evo_alignments evo_alignments 2>&1 (quitely packs and removes afterwards)

since 99% of the users will never touch the alignments

the webserver will

extract only the subdirs of the final result table from evo_alignments.zip
rename them with the script from above
generate figures from the script from above above

both scripts will be part of the webserver postprocessing pipeline and dont have to be integrated into the coprarna package. only the jalview call will be added to the documentation for sake of completeness.

what do you think?

JensGeorg commented 6 years ago

The data managment sounds good for me. I think the jalview part is only needed for the webserver and does not need to be part of the CopraRNA package for now.

For the midterm perspective: I am currently thinking about evolutionary stuff (not touching the original CopraRNA prediction) as a post-processing, which might need additional ressources.

martin-raden commented 6 years ago

ok, I have added the zipping to CorpaRNA cleanup in #9

I will work on the scripts for the webserver.

@JensGeorg let me know what you have in mind when settled. ;)

PatrickRWright / CopraRNA

jalview output #8