cgroza / GraffiTE

GraffiTE is a pipeline that finds polymorphic transposable elements in genome assemblies and/or long reads, and genotypes the discovered polymorphisms in read sets using genome-graphs.
Other
107 stars 4 forks source link

GraffiTE stops at the tsd_prep step when bypassing SV discovery #31

Open SarahBailey1998 opened 4 months ago

SarahBailey1998 commented 4 months ago

Hi,

I'm trying to run GraffiTE in the mode that bypasses the SV calling steps but it seems to get stuck at the tsd_prep step. I was wondering if you had any idea why?

Loading nextflow/23.04.4
  Loading requirement: Java/17.0.4
N E X T F L O W  ~  version 23.04.4
Launching `/workspace/Repo/GraffiTE/main.nf` [voluminous_sanger] DSL2 - revision: 20270181eb

▄████  ██▀███   ▄▄▄        █████▒ █████▒██▓▄▄▄█████▓▓█████
██▒ ▀█▒▓██ ▒ ██▒▒████▄    ▓██   ▒▓██           ██▒ ▓▒▓█   ▀
▒██░▄▄▄░▓██ ░▄█ ▒▒██  ▀█▄  ▒████ ░▒████ ░▒██▒▒ ▓██░ ▒░▒███
░▓█  ██▓▒██▀▀█▄  ░██▄▄▄▄██ ░▓█▒  ░░▓█▒  ░░██░░ ▓██▓ ░ ▒▓█  ▄
░▒▓███▀▒░██▓ ▒██▒  █   ▓██▒░▒█░   ░▒█░   ░██░  ▒██▒ ░ ░▒████▒
░▒   ▒ ░ ▒▓ ░▒▓░ ▒▒   ▓▒█░ ▒ ░    ▒ ░   ░▓    ▒ ░░   ░░ ▒░ ░
░   ░   ░▒ ░ ▒░  ▒   ▒▒ ░ ░      ░      ▒ ░    ░     ░ ░  ░
░ ░   ░   ░░   ░   ░   ▒    ░ ░    ░ ░    ▒ ░  ░         ░
░    ░           ░  ░               ░              ░  ░

V . null

Find and Genotype Transposable Elements Insertion Polymorphisms
in Genome Assemblies using a Pangenomic Approach

Authors: Cristian Groza and Clément Goubert
Bug/issues: https://github.com/cgroza/GraffiTE/issues

[-        ] process > repeatmask_VCF -
[-        ] process > tsd_prep       -

[-        ] process > repeatmask_VCF -
[-        ] process > tsd_prep       -
[-        ] process > tsd_search     -
[-        ] process > tsd_report     -

[-        ] process > repeatmask_VCF [  0%] 0 of 1
[-        ] process > tsd_prep       -
[-        ] process > tsd_search     -
[-        ] process > tsd_report     -

executor >  local (1)
[a0/da2e6a] process > repeatmask_VCF (1) [  0%] 0 of 1
[-        ] process > tsd_prep           -
[-        ] process > tsd_search         -
[-        ] process > tsd_report         -

executor >  local (2)
[a0/da2e6a] process > repeatmask_VCF (1) [100%] 1 of 1 ✔
[1c/8d8948] process > tsd_prep (1)       [  0%] 0 of 1
[-        ] process > tsd_search         -
[-        ] process > tsd_report         -

executor >  local (2)
[a0/da2e6a] process > repeatmask_VCF (1) [100%] 1 of 1 ✔
[1c/8d8948] process > tsd_prep (1)       [  0%] 0 of 1
[-        ] process > tsd_search         -
[-        ] process > tsd_report         -

executor >  local (2)
[a0/da2e6a] process > repeatmask_VCF (1) [100%] 1 of 1 ✔
[1c/8d8948] process > tsd_prep (1)       [100%] 1 of 1 ✔
[-        ] process > tsd_search         -
[-        ] process > tsd_report         -
Completed at: 23-May-2024 17:17:41
Duration    : 1m 10s
CPU hours   : 0.1
Succeeded   : 2

I'm using a vcf file from Sniffles2. But we also saw the same error with a vcf from SVIM-asm.

My reads.csv contains:

path,sample,type
./guppy_v6.4.6_sup.fq.gz,<tag>,ont

The commands I tried:

nextflow run main.nf \
    --vcf $vcfFile --genotype false \
    --reference $referenceGenome \
    --TE_library $TElib \
    --reads ./reads.csv \
    --graph_method graphaligner \
    --cores 4 \
    --repeatmasker_memory 24G \
    --graph_align_memory 24G \
    --vg_call_memory 24G
nextflow run main.nf \
    -profile cluster \
    -resume \
    --TE_library $TElib \
    --reference $referenceGenome \
    --reads ./reads.csv \
    --graph_method graphaligner \
    --vcf $vcfFile \
    --cores 1 \
    --repeatmasker_memory 24G \
    --graph_align_memory 24G \
    --vg_call_memory 24G
clemgoub commented 4 months ago

Hello @SarahBailey1998, I'm sorry about the issue. It looks like it may be similar to this (related to your machine /tmp dir configuration). The first thing to try would be this.

However if you think this is not the case, please send us the complete log, as well as the Nextflow process logs for the repeatmasker prosess. These should be located in work/a0/da2e6a*/.command.out and work/a0/da2e6a*/.command.err.

Let me know!

Clément

cgroza commented 4 months ago

It also happened to me in the past when RepeatMasker ran out of memory and was killed, leaving an empty annotation and therefore no input for the TSD steps.

SarahBailey1998 commented 4 months ago

Hi,

Thanks for your help. I have updated the singularity options and added more memory to the run but still get an error. I tried using 64 GB, would that be enough?

Here are the logs from repeatmask_VCF: .command.err:

INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
--------------------------------------------------------------------------
A call to mkdir was unable to create the desired directory:

  Directory: /workspace
  Error:     Read-only file system

Please check to ensure you have adequate permissions to perform
the desired operation.
--------------------------------------------------------------------------
[<node>] [[0,1],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 107
[<node>] [[0,1],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 346
[<node>] [[0,1],0] ORTE_ERROR_LOG: Error in file ess_singleton_module.c at line 340
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[<node>] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
#################################

Searching in repeatmasker_dir/indels.fa.out

#################################
Finding matching elements
#################################

#################################
Phase 1 : exact matches
#################################
0 matches found in non-fuzzy phase

#################################
8 elements found without match
#################################

#################################
Output file should be manually edited to take into account all specificities of the considered organism!
#################################

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

   *****       ***   vcfR   ***       *****
   This is vcfR 1.13.0 
     browseVignettes('vcfR') # Documentation
     citation('vcfR') # Citation
   *****       *****      *****       *****

Warning message:
The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
i Using compatibility `.name_repair`. 
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted
sort: cannot create temporary file in '/workspace/hrasrb/Repo/GraffiTE_temp/temp/': No such file or directory
sort: cannot create temporary file in '/workspace/hrasrb/Repo/GraffiTE_temp/temp/': No such file or directory

.command.out:

RepeatMasker version 4.1.4

WARNING: The nolow option should be used with caution.  This option
         doesn't simply filter out simple repeats and low-complexity
         annotations from the output, rather it doesn't run these
         searches at all.  The simple repeats, and low-complexity
         sequences may then be falsely annotated as fragments of
         TE families that contain short stretches of them.

Search Engine: NCBI/RMBLAST [ 2.13.0+ ]
Using Custom Repeat Library: genome_list.txt.panEDTA.TElib.fa

Building general libraries in: /home/hrasrb/.RepeatMaskerCache//general

analyzing file indels.fa
identifying matches to genome_list.txt.panEDTA.TElib.fa sequences in batch 1 of 1
processing output: 
cycle 1 
cycle 2 
cycle 3 
cycle 4 
cycle 5 
cycle 6 
cycle 7 
cycle 8 
cycle 9 
cycle 10 
Generating output... 
masking
done
Building onecode LTR dictionary...
Running onecode...
Concatenate outputs...
Parse outputs...
Cleanup...
Scanning file to determine attributes.
File attributes:
  meta lines: 80
  header_line: 81
  variant count: 106939
  column count: 10
Meta line 80 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
  Character matrix gt rows: 106939
  Character matrix gt cols: 10
  skip: 0
  nrows: 106939
  row_num: 0
Processed variant: 106939
All variants processed
 [1] "CHROM"            "POS"              "qry_id"           "REF"             
 [5] "ALT"              "n_hits"           "fragmts"          "match_lengths"   
 [9] "repeat_ids"       "matching_classes" "strands"          "RM_id"           
compute repeat proportion for each SVs...
Mammalian filters OFF, writing vcf...
cgroza commented 4 months ago

Yes in your case, your singularity is not configured correctly:

A call to mkdir was unable to create the desired directory:

  Directory: /workspace
  Error:     Read-only file system

Try unsetting SINGULARITYENV_TMPDIR

unset SINGULARITYENV_TMPDIR

Also, please post your Graffite/nextflow.config.

SarahBailey1998 commented 4 months ago

Thanks! I tried those suggestions and still get an error so I'm waiting on some help from our HPC team about the configuration problem.

Here's my nextflow.config:

manifest.defaultBranch = 'main'
singularity.enabled = true
singularity.autoMounts = true
singularity.runOptions = '--contain --bind /workspace/$USER/tmp/:/tmp'

profiles {
    standard {
        process.executor = 'local'
        process.container = '/workspace/hrasrb/Repo/graffite_latest.sif'
    }

    cluster {
        process.executor = 'slurm'
        process.container = '/workspace/hrasrb/Repo/graffite_latest.sif'
        process.scratch = '$SLURM_TMPDIR'
    }

    cloud {
        process.executor = 'aws'
        process.container = '/workspace/hrasrb/Repo/graffite_latest.sif'
    }

}
clemgoub commented 3 months ago

Hello @SarahBailey1998,

Was your HPC team able to solve the issue? Thanks!

Clément