DrosophilaGenomeEvolution / TrEMOLO

Transposable Elements MOvement detection using LOng reads
GNU General Public License v3.0
18 stars 5 forks source link

FIND_SV_ON_REF: ValueError: zero-size array to reduction operation minimum which has no identity #8

Closed cgroza closed 10 months ago

cgroza commented 1 year ago

Hi,

I have managed to complete the TSD step. Now I am on the FIND_SV_ON_REF step. I am seeing this error now:

rule FIND_SV_ON_REF:
    input: AKA-018_out/INPUT/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna, AKA-018_out/OUTSIDER/TE_TOWARD_GENOME/PSEUDO_GENOME_TE_DB_ID.fasta, AKA-018_out/INPUT/MCTE.fast
a, AKA-018_out/OUTSIDER/TE_DETECTION/FILTER_BLAST_SEQUENCE_INDEL_vs_DBTE.csv, AKA-018_out/tmp_TrEMOLO_output_rule/rule_tmp_TE_TOWARD_GENOME_AKA-018_out
    output: AKA-018_out/tmp_TrEMOLO_output_rule/rule_tmp_FIND_SV_ON_REF_AKA-018_out, AKA-018_out/OUTSIDER/INSIDER_VR/pm_against_ref.sam, AKA-018_out/OUTSIDER/INSIDER_VR/pm_agai
nst_ref.sam.delta, AKA-018_out/OUTSIDER/INSIDER_VR/assemblytics_out.Assemblytics_structural_variants.bed, AKA-018_out/OUTSIDER/INSIDER_VR/assemblytics_out.Assemblytics_assembly
_stats.txt
    log: AKA-018_out/log/FIND_TE_ON_REF
    jobid: 4

/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py:3440: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
  File "/home/cgroza/TrEMOLO/lib/python/Assemblytics_uniq_anchor.py", line 440, in <module>
    main()
  File "/home/cgroza/TrEMOLO/lib/python/Assemblytics_uniq_anchor.py", line 437, in main
    args.func(args)
  File "/home/cgroza/TrEMOLO/lib/python/Assemblytics_uniq_anchor.py", line 227, in run
    f_stats_out.write( "Min: %s\n" % gig_meg(np.min(ref_lengths)))
  File "<__array_function__ internals>", line 5, in amin
  File "/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py", line 2879, in amin
    return _wrapreduction(a, np.minimum, 'min', axis, None, out,
  File "/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction                                                                               return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation minimum which has no identity
[Fri Mar 24 02:26:20 2023]
Error in rule FIND_SV_ON_REF:
    jobid: 4
    output: AKA-018_out/tmp_TrEMOLO_output_rule/rule_tmp_FIND_SV_ON_REF_AKA-018_out, AKA-018_out/OUTSIDER/INSIDER_VR/pm_against_ref.sam, AKA-018_out/OUTSIDER/INSIDER_VR/pm_agai
nst_ref.sam.delta, AKA-018_out/OUTSIDER/INSIDER_VR/assemblytics_out.Assemblytics_structural_variants.bed, AKA-018_out/OUTSIDER/INSIDER_VR/assemblytics_out.Assemblytics_assembly
_stats.txt
    log: AKA-018_out/log/FIND_TE_ON_REF (check log file(s) for error message)
    shell:

log.tar.gz

It seems I am really close to completing the pipeline, this is the step where the annotation is moved to the reference genome? My thanks, Cristian

M-D75 commented 1 year ago

Hi,

The FIND_SV_ON_REF step (annotation of SV on the reference) does not work because in the previous step TE_TOWARD_GENOME (integration of TE on the assembly) there is a failure of integration on the assembled genome, no TE could be integrated.

In log file Snakefile_outsider.log

•••
 [SNK]--[Fri Mar 24 02:22:10 EDT 2023] PUT TE OUTSIDER ON GENOME... 
•••

[Fri Mar 24 02:22:10 EDT 2023] LOG TASK AKA-018_out/log/TE_TOWARD_GENOME.out, AKA-018_out/log/TE_TOWARD_GENOME.err
INTEGRATE TE DB_ID TO GENOME...
INTEGRATE TE IN READS TO GENOME...
CHECKING TE INTEGRATED...
TOTAL TE: 1933 ; TE INTEGRATE ON GENOME TE_DB_ID : 0 ;
TOTAL TE: 1933 ; TE INTEGRATE ON NEO GENOME: 0 ;
[Fri Mar 24 02:26:14 EDT 2023] LOG TASK AKA-018_out/log/FIND_TE_ON_REF.out, AKA-018_out/log/FIND_TE_ON_REF.err

Could you tell me if the files OUTSIDER/TE_TOWARD_GENOME/TRUE_POSITION_TE.fasta, OUTSIDER/TE_TOWARD_GENOME/TRUE_POSITION_TE_READS.fasta contain several sequences or are they empty?

Indeed it is the step of annotation of the TE on the reference, if that does not have you utility you can desactivate this step by changing putting the variable INTEGRATE_TE_TO_GENOME to False in the config.yaml file

Sorry, Mourdas

M-D75 commented 1 year ago

Hi,

I was able to identify the problem I assume that your assembly (GENOME parameter) fasta is in classic format i.e. :

>header_sequence
80 bp
next 80 bp
next 80 bp
>header_sequence2
....

For integration it was necessary to have this type of format

>header_sequence
ALL bp on one line
>header_sequence2
...

You can either reformat the genome or get the update available which also contains corrections on the TSD part and others.

Mourdas