darencard / GenomeAnnotation

Various scripts that I have created that are useful for genome annotation (repeats & proteins)
GNU General Public License v3.0
16 stars 5 forks source link

rmOutToGFF3custom sorts gff3 header into gff3 file #1

Closed AdamStuckert closed 2 years ago

AdamStuckert commented 3 years ago

Hi Daren,

Thanks for posting some really detailed work on Maker, especially with examples including custom masked repeats. I found a slight error with your rmOutToGFF3custom script. As it is written it sorts the ##gff-version 3 header along with the contents of the gff file. Moving one parenthesis fixes it.

#!/usr/bin/env bash
usage()
{
cat << EOF
rmOutToGFF3custom
Version 1 (2019-10-10)
License: GNU GPLv2
To report bugs or errors, please contact Daren Card (dcard@uta.edu).
This script is provided as-is, with no support and no guarantee of proper or
desirable functioning.

This script converts the .out file from RepeatMasker to a GFF3 file. Note that the output
is probably not perfect GFF3, so beware with downstream applications. This script
emulates the rmOutToGFF3.pl script supplied with RepeatMasker but provides a fuller ID
("target=") for each element in column 9 of the GFF. This ID includes the matching element,
like rmOutToGFF3.pl, but also includes the repeat family: in the format <Family>/<Element>.
This change is because many matching elements produced from RepeatModeler have IDs that
provide no information about repeat family classification. Output is written to standard
output (SDOUT).

This script requires requires awk, which should be available on any standard Unix system.

rmOutToGFF3custom -o <RM.out> [-h] > <name.gff3>

OPTIONS:
        -h              usage information and help (this message)
        -o              RepeatMasker .out file
EOF
}

while getopts "ho:" OPTION
do
        case $OPTION in
                help)
                        usage
                        exit 1
                        ;;
                o)
                        RMOUT=$OPTARG
                        ;;
        esac
done

if [[ -z $RMOUT ]]
then
        usage
        exit 1
fi

cat <(echo "##gff-version 3") \
<(cat ${RMOUT} | tail -n +4 | \
awk -v OFS="\t" '{ if ($12 ~ /)/) print $5, "RepeatMasker", "dispersed_repeat", $6, $7, $1, $9, ".", "Target="$11"/"$10" "$14" "$13; \
else print $5, "RepeatMasker", "dispersed_repeat", $6, $7, $1, $9, ".", "Target="$11"/"$10" "$12" "$13 }' | \
awk -v OFS="\t" '{ if ($7 == "C") print $1, $2, $3, $4, $5, $6, "-", $8, $9; else print $0 }' | \
sort -k1,1 -k4,4n -k5,5n )
darencard commented 2 years ago

Nice catch, @AdamStuckert! Thanks for reporting! Ironically, I caught this mistake and fixed it earlier this fall, shortly before you reached out - perhaps you had downloaded a copy of the script before I made the change in late September.