enormandeau / gawn

Genome Annotation Without Nightmares
43 stars 13 forks source link

Cannot locate cdna.transdecoder_dir & hangup error at blastx #2

Closed chamalis closed 6 years ago

chamalis commented 7 years ago
$ cat 02_infos/gawn_config.sh 
#!/bin/bash

# Modify the following parameter values according to your experiment
# Do not modify the parameter names or remove parameters
# Do not add spaces around the equal (=) sign

# Global parameters
NCPUS=2                    # Number of CPUs to use for analyses (int, 1+)

# Genome indexing
SKIP_GENOME_INDEXING=0      # 1 to skip genome indexing, 0 to index it

# Genome annotation with transcriptome
# NOTE: do not use compressed fasta files
GENOME_NAME="SRR001665_contigs_greater200.fasta"  # Name of genome fasta file found in 03_data
TRANSCRIPTOME_NAME="evidence.fasta"    # Name of transcriptome fasta file found in 03_data

# Swissprot
SWISSPROT_DB="uniprot_sprot.db"
$ /usr/bin/time -v ./gawn 02_infos/gawn_config.sh &>stdout
$ cat stdout

 ----------------------------------------- 
GAWN - Genome Annotation Without Nightmares
 ----------------------------------------- 

02_infos/gawn_config.sh
 _______________________________________________________________________
/                                                                       \
| \nGAWN: Indexing genome
| --------------------------------------------------------------------- |
-k flag not specified, so building with default 15-mers
Sorting chromosomes in chrom order.  To turn off or sort other ways, use the -s flag.
Creating files in directory 03_data/indexed_genome
Running "/usr/lib/gmap/fa_coords"     -o "03_data/indexed_genome.coords" -f "03_data/indexed_genome.sources"
Opening file 03_data/SRR001665_contigs_greater200.fasta
  Processed short contigs (<1000000 nt): ...................................................................................................More than 100 short contigs.  Will stop printing.

============================================================
Contig mapping information has been written to file 03_data/indexed_genome.coords.
You should look at this file, and edit it if necessary
If everything is okay, you should proceed by running
    make gmapdb
============================================================
Running "/usr/lib/gmap/gmap_process"  -c "03_data/indexed_genome.coords" -f "03_data/indexed_genome.sources" | "/usr/lib/gmap/gmapindex"  -d indexed_genome -D "03_data/indexed_genome" -A 
Reading coordinates from file 03_data/indexed_genome.coords
Logging contig contig_10293114 at contig_10293114:1..461 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_1990817 at contig_1990817:1..803 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_2825224 at contig_2825224:1..2671 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_3235758 at contig_3235758:1..7205 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_4165115 at contig_4165115:1..6777 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_4270659 at contig_4270659:1..13060 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_4302725 at contig_4302725:1..12221 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_5182773 at contig_5182773:1..3558 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_5850706 at contig_5850706:1..1061 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_6139417 at contig_6139417:1..15743 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_6922011 at contig_6922011:1..7200 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7040037 at contig_7040037:1..12701 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_767663 at contig_767663:1..11555 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_992401 at contig_992401:1..4328 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_10263289 at contig_10263289:1..9800 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_1275019 at contig_1275019:1..2348 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_1467656 at contig_1467656:1..4439 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_2913181 at contig_2913181:1..10354 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_6393479 at contig_6393479:1..419 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_6916793 at contig_6916793:1..1939 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7048192 at contig_7048192:1..8206 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7852423 at contig_7852423:1..4287 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7993995 at contig_7993995:1..1449 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_8425974 at contig_8425974:1..7394 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_8558094 at contig_8558094:1..1897 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_8650055 at contig_8650055:1..9126 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_9296210 at contig_9296210:1..27794 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_9392475 at contig_9392475:1..3785 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_9594499 at contig_9594499:1..5134 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_2479530 at contig_2479530:1..1045 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_5659803 at contig_5659803:1..15580 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_6019695 at contig_6019695:1..6133 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_6409070 at contig_6409070:1..26765 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_682563 at contig_682563:1..216 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_6907888 at contig_6907888:1..2049 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7146251 at contig_7146251:1..249 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7192674 at contig_7192674:1..10791 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7207442 at contig_7207442:1..13078 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7235720 at contig_7235720:1..2256 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_8775926 at contig_8775926:1..2871 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_9091474 at contig_9091474:1..14820 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_10181993 at contig_10181993:1..7049 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_1426058 at contig_1426058:1..7386 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_1811339 at contig_1811339:1..7447 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_2951315 at contig_2951315:1..2669 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_3399947 at contig_3399947:1..2087 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_3959747 at contig_3959747:1..280 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7193903 at contig_7193903:1..6273 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7282878 at contig_7282878:1..4505 in genome indexed_genome
 => primary (linear) chromosome
Logging contig contig_7681370 at contig_7681370:1..14144 in genome indexed_genome
 => primary (linear) chromosome
More than 50 contigs.  Will stop printing messages
Total genomic length = 4542792 bp
Have a total of 509 chromosomes
Writing chromosome file 03_data/indexed_genome/indexed_genome.chromosome
Chromosome contig_24343 has universal coordinates 1..251
Chromosome contig_45221 has universal coordinates 252..705
Chromosome contig_53152 has universal coordinates 706..8434
Chromosome contig_176392 has universal coordinates 8435..9217
Chromosome contig_235461 has universal coordinates 9218..49727
Chromosome contig_356264 has universal coordinates 49728..50437
Chromosome contig_359465 has universal coordinates 50438..55046
Chromosome contig_372625 has universal coordinates 55047..57797
Chromosome contig_377765 has universal coordinates 57798..64707
Chromosome contig_407593 has universal coordinates 64708..68126
Chromosome contig_414500 has universal coordinates 68127..75840
Chromosome contig_449583 has universal coordinates 75841..105061
Chromosome contig_460419 has universal coordinates 105062..109783
Chromosome contig_486889 has universal coordinates 109784..120399
Chromosome contig_534487 has universal coordinates 120400..122131
Chromosome contig_535793 has universal coordinates 122132..123830
Chromosome contig_555012 has universal coordinates 123831..179288
Chromosome contig_555933 has universal coordinates 179289..184470
Chromosome contig_556834 has universal coordinates 184471..185570
Chromosome contig_592056 has universal coordinates 185571..190262
Chromosome contig_602108 has universal coordinates 190263..196783
Chromosome contig_633224 has universal coordinates 196784..198490
Chromosome contig_634687 has universal coordinates 198491..205619
Chromosome contig_650065 has universal coordinates 205620..219728
Chromosome contig_682563 has universal coordinates 219729..219944
Chromosome contig_695464 has universal coordinates 219945..220621
Chromosome contig_721559 has universal coordinates 220622..255972
Chromosome contig_767663 has universal coordinates 255973..267527
Chromosome contig_776676 has universal coordinates 267528..281040
Chromosome contig_788404 has universal coordinates 281041..300920
Chromosome contig_789892 has universal coordinates 300921..303507
Chromosome contig_839883 has universal coordinates 303508..328951
Chromosome contig_853386 has universal coordinates 328952..339631
Chromosome contig_872771 has universal coordinates 339632..347448
Chromosome contig_897859 has universal coordinates 347449..363035
Chromosome contig_900031 has universal coordinates 363036..376471
Chromosome contig_933230 has universal coordinates 376472..393525
Chromosome contig_955799 has universal coordinates 393526..400362
Chromosome contig_989836 has universal coordinates 400363..410256
Chromosome contig_992401 has universal coordinates 410257..414584
Chromosome contig_1000503 has universal coordinates 414585..428712
Chromosome contig_1005279 has universal coordinates 428713..429065
Chromosome contig_1036282 has universal coordinates 429066..430909
Chromosome contig_1038394 has universal coordinates 430910..431973
Chromosome contig_1057666 has universal coordinates 431974..432749
Chromosome contig_1144471 has universal coordinates 432750..433024
Chromosome contig_1173093 has universal coordinates 433025..444356
Chromosome contig_1215359 has universal coordinates 444357..447896
Chromosome contig_1257485 has universal coordinates 447897..464323
Chromosome contig_1275019 has universal coordinates 464324..466671
More than 50 contigs.  Will stop printing messages
Writing chromosome IIT file 03_data/indexed_genome/indexed_genome.chromosome.iit
Writing IIT file header information...coordinates require 4 bytes each...done
Processing null division/chromosome...sorting...writing...done (509 intervals)
Writing IIT file footer information...done
Writing IIT file header information...coordinates require 4 bytes each...done
Processing null division/chromosome...sorting...writing...done (509 intervals)
Writing IIT file footer information...done
No alternate scaffolds observed
Running "/usr/lib/gmap/gmap_process"  -c "03_data/indexed_genome.coords" -f "03_data/indexed_genome.sources" | "/usr/lib/gmap/gmapindex"  -d indexed_genome -F "03_data/indexed_genome" -D "03_data/indexed_genome" -G
Genome length is 4542792 nt
Trying to allocate 425889*4 bytes of memory...succeeded.  Building genome in memory.
Reading coordinates from file 03_data/indexed_genome.coords
Writing contig contig_10293114 to universal coordinates 4527948..4528408
Writing contig contig_1990817 to universal coordinates 711269..712071
Writing contig contig_2825224 to universal coordinates 945851..948521
Writing contig contig_3235758 to universal coordinates 1112305..1119509
Writing contig contig_4165115 to universal coordinates 1441933..1448709
Writing contig contig_4270659 to universal coordinates 1450860..1463919
Writing contig contig_4302725 to universal coordinates 1463920..1476140
Writing contig contig_5182773 to universal coordinates 1898605..1902162
Writing contig contig_5850706 to universal coordinates 2076850..2077910
Writing contig contig_6139417 to universal coordinates 2196498..2212240
Writing contig contig_6922011 to universal coordinates 2512854..2520053
Writing contig contig_7040037 to universal coordinates 2555128..2567828
Writing contig contig_767663 to universal coordinates 255973..267527
Writing contig contig_992401 to universal coordinates 410257..414584
Writing contig contig_10263289 to universal coordinates 4517007..4526806
Writing contig contig_1275019 to universal coordinates 464324..466671
Writing contig contig_1467656 to universal coordinates 540170..544608
Writing contig contig_2913181 to universal coordinates 1000611..1010964
Writing contig contig_6393479 to universal coordinates 2271333..2271751
Writing contig contig_6916793 to universal coordinates 2510915..2512853
Writing contig contig_7048192 to universal coordinates 2569810..2578015
Writing contig contig_7852423 to universal coordinates 2982252..2986538
Writing contig contig_7993995 to universal coordinates 3009703..3011151
Writing contig contig_8425974 to universal coordinates 3284433..3291826
Writing contig contig_8558094 to universal coordinates 3352042..3353938
Writing contig contig_8650055 to universal coordinates 3380962..3390087
Writing contig contig_9296210 to universal coordinates 3976533..4004326
Writing contig contig_9392475 to universal coordinates 4140350..4144134
Writing contig contig_9594499 to universal coordinates 4256322..4261455
Writing contig contig_2479530 to universal coordinates 834101..835145
Writing contig contig_5659803 to universal coordinates 2036886..2052465
Writing contig contig_6019695 to universal coordinates 2121794..2127926
Writing contig contig_6409070 to universal coordinates 2287474..2314238
Writing contig contig_682563 to universal coordinates 219729..219944
Writing contig contig_6907888 to universal coordinates 2508866..2510914
Writing contig contig_7146251 to universal coordinates 2611676..2611924
Writing contig contig_7192674 to universal coordinates 2634499..2645289
Writing contig contig_7207442 to universal coordinates 2655003..2668080
Writing contig contig_7235720 to universal coordinates 2692919..2695174
Writing contig contig_8775926 to universal coordinates 3499694..3502564
Writing contig contig_9091474 to universal coordinates 3897132..3911951
Writing contig contig_10181993 to universal coordinates 4509958..4517006
Writing contig contig_1426058 to universal coordinates 513065..520450
Writing contig contig_1811339 to universal coordinates 617904..625350
Writing contig contig_2951315 to universal coordinates 1010965..1013633
Writing contig contig_3399947 to universal coordinates 1151182..1153268
Writing contig contig_3959747 to universal coordinates 1339275..1339554
Writing contig contig_7193903 to universal coordinates 2648730..2655002
Writing contig contig_7282878 to universal coordinates 2775352..2779856
More than 50 contigs.  Will stop printing messages
A total of 0 non-ACGTNX characters were seen in the genome.
Running cat "03_data/indexed_genome/indexed_genome.genomecomp" | "/usr/lib/gmap/gmapindex" -d indexed_genome -U > "03_data/indexed_genome/indexed_genome.genomebits128"
Running cat "03_data/indexed_genome/indexed_genome.genomecomp" | "/usr/lib/gmap/gmapindex" -k 15 -q 3  -d indexed_genome -F "03_data/indexed_genome" -D "03_data/indexed_genome" -N
Counting positions in genome indexed_genome (15 bp every 3 bp), position 0
Number of offsets: 1512047 => pages file not required
Running "/usr/lib/gmap/gmapindex" -k 15 -q 3  -d indexed_genome -F "03_data/indexed_genome" -D "03_data/indexed_genome" -O  "03_data/indexed_genome/indexed_genome.genomecomp"
Offset compression types: bitpack64
Allocating 16777216*1 bytes for packsizes
Allocating 16777216*8 bytes for bitpacks
Indexing offsets of oligomers in genome indexed_genome (15 bp every 3 bp), position 0
Writing 1073741825 offsets compressed via bitpack64...done
Running "/usr/lib/gmap/gmapindex" -k 15 -q 3  -d indexed_genome -F "03_data/indexed_genome" -D "03_data/indexed_genome" -P "03_data/indexed_genome/indexed_genome.genomecomp"
Looking for index files in directory 03_data/indexed_genome
  Pointers file is indexed_genome.ref153offsets64meta
  Offsets file is indexed_genome.ref153offsets64strm
  Positions file is indexed_genome.ref153positions
Expanding offsetsstrm into counters...done
Allocating 21797152 bytes for counterstrm
Trying to allocate 1512047*4 bytes of memory for positions...succeeded.  Building positions in memory.
Indexing positions of oligomers in genome indexed_genome (15 bp every 3 bp), position 0
Writing 1512047 genomic positions to file 03_data/indexed_genome/indexed_genome.ref153positions ...
done
Running "/usr/lib/gmap/gmapindex" -d indexed_genome -F "03_data/indexed_genome" -D "03_data/indexed_genome" -S
Genome length is 4542792
Building suffix array
SACA_K called with n = 4542793, K = 5, level 0
SACA_K called with n = 1276733, K = 0, level 1
SACA_K called with n = 408368, K = 0, level 2
SACA_K called with n = 133769, K = 0, level 3
For indexsize 12, occupied 3462610/16777216
Optimal indexsize = 12
Running "/usr/lib/gmap/gmapindex" -d indexed_genome -F "03_data/indexed_genome" -D "03_data/indexed_genome" -L
Building LCP array
Writing temporary file for rank...done
Writing temporary file for permuted sarray...done
Byte-coding: 4542793 values < 255, 0 exceptions >= 255 (0.0%)
Building DC array
Building child array
Byte-coding: 4527846 values < 255, 14947 exceptions >= 255 (0.3%)
Writing file 03_data/indexed_genome/indexed_genome.salcpchilddcdone
Found 0 exceptions
|                                                                       |
\_______________________________________________________________________/

 _______________________________________________________________________
/                                                                       \
| GAWN: Annotating genome with transcriptome
| --------------------------------------------------------------------- |
|                                                                       |
\_______________________________________________________________________/

 _______________________________________________________________________
/                                                                       \
| GAWN: Adding UTR-3 and UTR-5 regions
| --------------------------------------------------------------------- |
GAWN: Create GTF file from GFF3
GAWN: Create genome-based transcriptome
-parsing cufflinks output: 04_annotation/SRR001665_contigs_greater200.gtf
-parsing genome fasta: 03_data/SRR001665_contigs_greater200.fasta
-done parsing genome.
GAWN: Creage predicted GFF3 from GTF file
GAWN: Find best ORF candidates
NAME
    Transdecoder.LongOrfs <http://transdecoder.github.io> - Transcriptome
    Protein Prediction

USAGE
    Required:

     -t <string>                            transcripts.fasta

    Optional:

     --gene_trans_map <string>              gene-to-transcript identifier mapping file (tab-delimited, gene_id<tab>trans_id<return> ) 

     -m <int>                               minimum protein length (default: 100)

     -G <string>                            genetic code (default: universal; see PerlDoc; options: Euplotes, Tetrahymena, Candida, Acetabularia)

     -S                                     strand-specific (only analyzes top strand)

     -p <int>                               shorten potential 5' partials if they are this percentage of the original protein or longer.

Genetic Codes
    See <http://golgi.harvard.edu/biolinks/gencode.html>. These are
    currently supported:

     universal (default)
     Euplotes
     Tetrahymena
     Candida
     Acetabularia
     Mitochondrial-Canonical
     Mitochondrial-Vertebrates
     Mitochondrial-Arthropods
     Mitochondrial-Echinoderms
     Mitochondrial-Molluscs
     Mitochondrial-Ascidians
     Mitochondrial-Nematodes
     Mitochondrial-Platyhelminths
     Mitochondrial-Yeasts
     Mitochondrial-Euascomycetes
     Mitochondrial-Protozoans

GAWN: Move transdecoder_dir
mv: cannot stat 'SRR001665_contigs_greater200.cdna.transdecoder_dir': No such file or directory
GAWN: Create final genome annotation file
Error, cannot locate file: 04_annotation/SRR001665_contigs_greater200.cdna.transdecoder_dir/longest_orfs.gff3 at ./01_scripts/TransDecoder/util/cdna_alignment_orf_to_genome_orf.pl line 23.
GAWN: Copy genome annotation to 05_results
|                                                                       |
\_______________________________________________________________________/

 _______________________________________________________________________
/                                                                       \
| GAWN: Annotating transcriptome with swissprot
| --------------------------------------------------------------------- |
SWISS DB /home/stelarov/programming/rna_seq/gap/packages/lib/uniprot_sprot

The process will hangup on the blastx command and will never end/exit.

The output directories contain:

$ ls 03_data/
evidence.fasta  indexed_genome  SRR001665_contigs_greater200.fasta
$
$ ls 04_annotation/
evidence.hits       genbank_info                       SRR001665_contigs_greater200.gawn_annotated.gff3  SRR001665_contigs_greater200.gtf
evidence.swissprot  SRR001665_contigs_greater200.cdna  SRR001665_contigs_greater200.gff3                 SRR001665_contigs_greater200.predicted.gff3
$
$ ls 05_results/
evidence_annotation_table.tsv  SRR001665_contigs_greater200_annotation_table.tsv  SRR001665_contigs_greater200.gawn_annotated.gff3
enormandeau commented 7 years ago

Please see: https://github.com/enormandeau/gawn/issues/5

GAWN is presently broken for some newer versions of the dependencies. I need to fix this and update the dependency requirements.

chamalis commented 7 years ago

I updated my initial comment, since most of the stdout errors were an outcome of blast db misconfiguration. I will keep an eye for the updates. Thanks :)

enormandeau commented 6 years ago

Should be fixed in v0.3 where I no longer look for UTR regions.

If you try v0.3, please report your success.