GroundB / RepeatDefeaters

MIT License
1 stars 2 forks source link

Error caused by process 'ANNOTATE_REPEATS' #5

Open mmontonerin opened 2 years ago

mmontonerin commented 2 years ago

Command run: nextflow run -params-file params.yml -c custom.config -profile uppmax /proj/rosling_storage/AMF/b2017181_nobackup/merce/try_RepeatDefeaters/RepeatDefeaters_ccandik

I tried to check the .command files from the process ANNOTATE_REPEATS, but they gave me no more extra information about what might be the problem.

Error message:

[41/fe26b5] process > RENAME_REPEAT_MODELER_SEQUENCES  [100%] 1 of 1 ✔
[9e/bd9447] process > PFAM_TRANSPOSIBLE_ELEMENT_SEARCH [100%] 1 of 1 ✔
[f8/66daec] process > BUILD_PROTEIN_REF_BLAST_DB (1)   [100%] 1 of 1 ✔
[d4/4238ee] process > BLASTX_AND_FILTER (1)            [100%] 2 of 2 ✔
[bf/5590b1] process > PFAM_SCAN (2)                    [100%] 2 of 2 ✔
[49/5e98ba] process > ANNOTATE_REPEATS                 [100%] 1 of 1, failed: 1 ✘
[ca/cb7da2] process > BUILD_TREP_BLAST_DB (1)          [100%] 1 of 1 ✔
[-        ] process > TREP_BLASTN                      -
[-        ] process > ADD_TREP_ANNOTATION              -
[c2/3f37c5] process > CUSTOM_HMM_SCAN (2)              [100%] 2 of 2 ✔
[3e/db88fa] process > MERGE_DOMAIN_TABLE (2)           [100%] 2 of 2 ✔
[-        ] process > REANNOTATE_REPEATS               -
[-        ] process > BUILD_ANNOTATED_LIB_BLAST_DB     -
[-        ] process > RECIPROCAL_BLASTN                -
[-        ] process > REDUNDANT_HITS                   -
Error executing process > 'ANNOTATE_REPEATS'                                                                              

Caused by:                                                                                                                
  Process `ANNOTATE_REPEATS` terminated with an error exit status (1)                                                     

Command executed:                                                                                                         

  # Find unclassified consensus with TE domains                                                                           
  for TBL in ccandi_k.minus.predicted.pfamtbl ccandi_k.plus.predicted.pfamtbl; do                                         
      # grep #1    : Find unclassified consensus                                                                          
      # grep #2    : which have TE domains                                                                                
      # cut + uniq : and extract their id's                                                                               
      grep -i "#unknown" "$TBL" | \                                                                                       
          tee "${TBL}.unclassified" | \                                                                                   
          grep -i -w -f Pfam.Proteins_wTE_Domains.seqid | \                                                               
          tee -a ccandi_k.Unclassified_consensus_TEs | \                                                                  
          cut -f1 -d"#" | uniq > "${TBL/.pfamtbl/.unclassified_ids}"                                                      
  done                                                                                                                    

  # Concatenate ids of consensus with TE domains from both strands                                                        
  cat *.unclassified_ids | uniq > ccandi_k.Unclassified_consensus_TEs.ids                                                 

  # Find unclassified consensus without TE domains                                                                        
  for UNCLASSIFIED in *.unclassified; do                                                                                  
      # grep       : Remove consensus which have TE domains                                                               
      # cut + uniq : and extract their id's                                                                               
      grep -v -f ccandi_k.Unclassified_consensus_TEs.ids "$UNCLASSIFIED" | \                                              
          tee "${UNCLASSIFIED}.TEpurged" | \                                                                              
          cut -f1 -d'#' | uniq > "${UNCLASSIFIED}.TEpurged.ids"                                                           
  done                                                                                                                    

  # Use shell expansion to expand plus and minus strand files for unsorted inner join                                     
  grep -f *.TEpurged.ids > ccandi_k.consensus.both.strand                                                                 
  # ccandi_k.consensus.both.strand : Unclassified consensus sequences that have                                           
  # non-TE domains detected in both strands.                                                                              
  # These are tricky to annotate.                                                                                         

  # In consensus without TE domains, remove consensus with non-TE domains on both strands                                 
  # (leaving consensus with non-TE domains on a single-strand)                                                            
  for TEPURGED in *.TEpurged; do                                                                                          
      # grep       : Remove consensus with non-TE domains on both strands                                                 
      # awk        : then remove consensus shorter than 100 amino acids                                                   
      # cut + uniq : and extract their id's                                                                               
      grep -v -f ccandi_k.consensus.both.strand "$TEPURGED" | \                                                           
          awk '$11 >= 100' | tee "$TEPURGED.mono" | \                                                                     
          cut -f1 -d'#' | uniq > "$TEPURGED.mono.ids"                                                                     
  done                                                                                                                    

  # Make a copy of repeat library to be modified.                                                                         
  cp ccandi_k.fasta ccandi_k.renamed.fasta                                                                                

  # Rename repeat model based on strand evidence.                                                                         
  for CONSENSUS in *.mono.ids; do                                                                                         
      # while       : for each consensus id                                                                               
      # echo        : record id as renamed                                                                                
      # NAMEHASH    : create a name suffix from the pfam domain table                                                     
      # OLDNAME     : find old name from repeat consensus library                                                         
      # sed         : replace Unknown with NAMEHASH                                                                       
      while read -r SEQID; do                                                                                             
          echo "$SEQID" >> ccandi_k.renamed                                                                               
          NAMEHASH=$( grep "${SEQID}#" "${CONSENSUS/.ids/}" | \                                                           
              tr -s " " "       " | cut -f7 | \                                                                           
              sort | uniq | \                                                                                             
              paste -s -d '-' )                                                                                           
          OLDNAME=$( grep "${SEQID}#" ccandi_k.fasta | cut -c2- )                                                         
          sed -i "s|$OLDNAME|${OLDNAME%Unknown}$NAMEHASH|g" ccandi_k.renamed.fasta                                        
      done < "$CONSENSUS"                                                                                                 
  done                                                                                                                    

  cat <<-END_VERSIONS > versions.yml                                                                                      
  "ANNOTATE_REPEATS":                                                                                                     
      awk  : $( awk  -W version |& head -n1 )                                                                             
      cat  : $( cat   --version |& head -n1 )                                                                             
      cut  : $( cut   --version |& head -n1 )                                                                             
      grep : $( grep  --version |& head -n1 )                                                                             
      paste: $( paste --version |& head -n1 )                                                                             
      sed  : $( sed   --version |& head -n1 )                                                                             
      sort : $( sort  --version |& head -n1 )                                                                             
      tee  : $( tee   --version |& head -n1 )                                                                             
      uniq : $( uniq  --version |& head -n1 )                                                                             
  END_VERSIONS                                                                                                            

Command exit status:                                                                                                      
  1                                                                                                                       

Command output:                                                                                                           
  (empty)                                                                                                                 

Command wrapper:                                                                                                          
  nxf-scratch-dir r49:/scratch/25481205/nxf.wzUqn0kDFO                                                                    

Work dir:                                                                                                                 
  /proj/rosling_storage/AMF/b2017181_nobackup/merce/try_RepeatDefeaters/RepeatDefeaters_ccandik/49/5e98bafe759cc9966cd2a7f2eae1b2                                                                                                                   

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

Params file

## The absolute path (full path, begins with / ) to the input data
## Repeat modeler library
repeat_modeler_fasta : '/proj/rosling_storage/AMF/comparative_genomics/annotation_v4/repeats/ccandi_k_combined_idrenamed_short/repeatmodeler/RM_21933.ThuApr221305072021/consensi.fa'
## Species short name for renaming sequences
species_short_name : 'ccandi_k'

## Workflow outputs
## The absolute path (full path, begins with / ) to the results folder
results : '/proj/rosling_storage/AMF/b2017181_nobackup/merce/try_RepeatDefeaters/results'

## Optional inputs (Remove # to uncomment)
## protein reference
protein_reference :
    - 'https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz'
#    - '<additional reference1>'
#    - '<additional reference2>'
## Path to key words (Describes PFAM entries with TE domains)
transposon_keywords : "./assets/pfam_te_domain_keywords.txt"
## Path to key words blacklist (Describes PFAM entries with TE domains that should be removed)
transposon_blacklist : "./assets/te_domain_keyword_blacklist.txt"
## Path to PFAM accession list of proteins with TE domains (skips PFAM_TRANSPOSIBLE_ELEMENT_SEARCH process)
#pfam_proteins_with_te_domain_list : '$baseDir/assets/pfam_te_domain_keywords.txt'
#pfam_proteins_with_te_domain_list : '$baseDir/assets/Pfam_R32.Proteins_wTE_Domains.seqid'
## PFAM HMM database path
pfam_hmm_db  : "ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/Pfam-A.hmm.gz"
pfam_hmm_dat : "ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/Pfam-A.hmm.dat.gz"
## PFAM-A database path
pfam_a_db : "ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/Pfam-A.full.uniprot.gz"

## Workflow package manager configuration
## Use conda instead of containers
# enable_conda : false
## When using singularity, construct image from a docker image
# singularity_pull_docker_container : false

## Uppmax cluster configuration
## UPPMAX project - Needed only when running on an UPPMAX cluster
project : 'snic2022-5-42'
## Convenience for adding additional cluster options to UPPMAX
##clusterOptions : ''

Custom.config file

// Nextflow configuration
// The absolute path (full path, begins with / ) to the work directory ( where intermediate results are stored )
// If you have a SNIC Storage allocation, use the nobackup folder in there.
workDir = '/proj/rosling_storage/AMF/b2017181_nobackup/merce/try_RepeatDefeaters/RepeatDefeaters_ccandik'
// Resume analysis from the last complete process executions (not from the beginning).
resume = true

// Uncomment to enable workflow reporting
// Workflow reporting
timeline {
    enabled = true
    file = "/proj/rosling_storage/AMF/b2017181_nobackup/merce/try_RepeatDefeaters/RepeatDefeaters_ccandik/pipeline_info/execution_timeline.html"
}
report {
    enabled = true
    file = "/proj/rosling_storage/AMF/b2017181_nobackup/merce/try_RepeatDefeaters/RepeatDefeaters_ccandik/pipeline_info/execution_report.html"
}
trace {
    enabled = true
    file = "/proj/rosling_storage/AMF/b2017181_nobackup/merce/try_RepeatDefeaters/RepeatDefeaters_ccandik/pipeline_info/execution_trace.txt"
}
dag {
    enabled = true
    file = "/proj/rosling_storage/AMF/b2017181_nobackup/merce/try_RepeatDefeaters/RepeatDefeaters_ccandik/pipeline_info/pipeline_dag.svg"
}
mahesh-panchal commented 2 years ago

I think I found the immediate cause of the error.

grep -i "#unknown" ccandi_k.minus.predicted.pfamtbl ccandi_k.plus.predicted.pfamtbl

Is returning an exit status of 1. It doesn't find a match for the word '#unknown'. Now the question is, why there's no '#unknown' keyword.

mahesh-panchal commented 2 years ago

@GroundB Which program is adding the '#unknown' tag? This seems to be a prerequisite for input preprocessing.

GroundB commented 2 years ago

It is RepeatModeler2, in its release there is a program called RepeatClassifier that annotates the models. https://github.com/Dfam-consortium/RepeatModeler/blob/30875549276ae88df77ac53d6d2272a0c1b526cc/RepeatClassifier

mahesh-panchal commented 2 years ago

@mmontonerin Is your input from RepeatModeler2 or something else?

mmontonerin commented 2 years ago

The consensi.fa repeat data that I have was generated by RepeatModeler/1.0.8_RM4.0.7