DrosophilaGenomeEvolution / TrEMOLO

Transposable Elements MOvement detection using LOng reads
GNU General Public License v3.0
18 stars 5 forks source link

extract_region_reads_vcf.py: AttributeError: 'NoneType' object has no attribute 'group' #5

Closed cgroza closed 5 months ago

cgroza commented 1 year ago

Hello,

Great pipeline! I am trying TrEMOLO with drosophila genome assemblies and ONT long reads. The pipeline works fine until this point:

[Thu Mar 16 19:54:30 2023]
rule GET_READS_TE:
    input: COR-014_out/OUTSIDER/TE_DETECTION/FILTER_BLAST_SEQUENCE_INDEL_vs_DBTE.csv, COR-014_out/OUTSIDER/TE_DETECTION/BLAST_SEQUENCE_INDEL_vs_DBTE.bln, COR-014_out/OUTSIDER/VARIANT_CALLING/SV.vcf, COR-014_out/OUTSIDER/TE_DETECTION/MERGE_TE/MERGE_TE_ALL.bed, COR-014_out/INPUT/SRR9951091_1.fastq.gz, COR-014_out/INPUT/MCTE.fasta, COR-014_out/tmp_TrEMOLO_output_rule/rule_tmp_FREQUENCE_COR-014_out
    output: COR-014_out/OUTSIDER/ID_BEST_READ_TE.txt, COR-014_out/tmp_TrEMOLO_output_rule/rule_tmp_GET_READS_TE_COR-014_out
    log: COR-014_out/log/GET_READS_TE
    jobid: 11

Traceback (most recent call last):
  File "/home/cgroza/TrEMOLO/lib/python/extract_region_reads_vcf.py", line 164, in <module>
    type_v  = "<" + re.search("SVTYPE=([A-Z]+)", spl[7]).group(1) + ">"
AttributeError: 'NoneType' object has no attribute 'group'
[Thu Mar 16 19:54:31 2023]
Error in rule GET_READS_TE:
    jobid: 11
    output: COR-014_out/OUTSIDER/ID_BEST_READ_TE.txt, COR-014_out/tmp_TrEMOLO_output_rule/rule_tmp_GET_READS_TE_COR-014_out
    log: COR-014_out/log/GET_READS_TE (check log file(s) for error message)
    shell:

Seems to me the regex search fails but the extract_region_reds_vcf.py script still tries to access the result? Do you know what could be causing this error and how could we fix it?

This is the state of the pipeline working directory when it exists:

-bash-4.2$ tree .
.
├── DELETION_TE.bed -> INSIDER/TE_DETECTION/DELETION_TE.bed
├── DELETION_TE_ON_REF.bed -> INSIDER/TE_DETECTION/DELETION_TE_ON_REF.bed
├── ID_TE_SV_REPORT.txt
├── INPUT
│   ├── GCA_020142085.fa -> /mnt/GCA_020142085.fa
│   ├── GCA_020142085.fa.mmi
│   ├── GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna -> /mnt/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna
│   ├── MCTE.fasta -> /mnt/MCTE.fasta
│   ├── MCTE.fasta.nhr
│   ├── MCTE.fasta.nin
│   ├── MCTE.fasta.nsq
│   └── SRR9951091_1.fastq.gz -> /mnt/SRR9951091_1.fastq.gz
├── INSIDER
│   ├── FREQ_INSIDER
│   │   └── DEPTH_TE_INSIDER.csv
│   ├── TE_DETECTION
│   │   ├── DELETION.bed
│   │   ├── DELETION.bln
│   │   ├── DELETION.csv
│   │   ├── DELETION_COMBINE_TE.csv
│   │   ├── DELETION_COUNT_TE.csv
│   │   ├── DELETION_SEQ.fasta
│   │   ├── DELETION_TE.bed
│   │   ├── DELETION_TE_ON_REF.bed
│   │   ├── INSERTION.bed
│   │   ├── INSERTION.bln
│   │   ├── INSERTION.csv
│   │   ├── INSERTION_COMBINE_TE.csv
│   │   ├── INSERTION_COUNT_TE.csv
│   │   ├── INSERTION_SEQ.fasta
│   │   ├── INSERTION_TE.bed
│   │   └── INSERTION_TE_ON_REF.bed
│   ├── VARIANT_CALLING
│   │   ├── assemblytics_out.Assemblytics.unique_length_filtered_l20000.delta
│   │   ├── assemblytics_out.Assemblytics_assembly_stats.txt
│   │   ├── assemblytics_out.Assemblytics_structural_variants.bed
│   │   ├── assemblytics_out.coords.csv
│   │   ├── assemblytics_out.coords.tab
│   │   ├── assemblytics_out.variants_between_alignments.bed
│   │   ├── assemblytics_out.variants_within_alignments.bed
│   │   ├── pm_against_ref.sam
│   │   └── pm_against_ref.sam.delta
│   └── instructions.txt
├── OUTSIDER
│   ├── FREQ_OPTIMIZED
│   │   ├── DEPTH_TE.csv
│   │   ├── MAPPING_POSTION_TE_MD.sorted.bam
│   │   └── MAPPING_POSTION_TE_MD.sorted.bam.bai
│   ├── ID_READS_TE
│   │   ├── reads_CM034921.1:sniffles.INS.1290:1104696-1104696.txt
│   │   ├── reads_CM034921.1:sniffles.INS.186:146442-146442.txt
│   │   ├── reads_CM034921.1:sniffles.INS.230:171201-171201.txt
│   │   ├── reads_CM034921.1:sniffles.INS.35:35121-35121.txt
│   │   └── reads_CM034921.1:sniffles.INS.792:656206-656206.txt
│   ├── MAPPING
│   │   ├── MAPPING_POSTION_TE.bam
│   │   ├── MAPPING_POSTION_TE.bam.bai
│   │   ├── SAMPLE_mapping_GENOME_MD.sorted.bam
│   │   ├── SAMPLE_mapping_GENOME_MD.sorted.bam.bai
│   │   └── stats.txt
│   ├── MAPPING_TO_REF
│   │   └── INSERTION_TE.bed
│   ├── READ_FASTQ_TE
│   ├── TE_DETECTION
│   │   ├── BLAST_SEQUENCE_INDEL_vs_DBTE.bln
│   │   ├── COMBINE_SV_SOFT.csv
│   │   ├── COMBINE_TE.csv
│   │   ├── DEPTH_TE.csv
│   │   ├── FILTER_BLAST_SEQUENCE_INDEL_vs_DBTE.csv
│   │   ├── FILTER_BLAST_SEQUENCE_INDEL_vs_DBTE_COUNT.csv
│   │   ├── MERGE_TE
│   │   │   ├── MERGE_TE_ALL.bed
│   │   │   ├── MERGE_TE_ALL_COUNT.csv
│   │   │   ├── tmp_ID_SOFT_commun.txt
│   │   │   ├── tmp_ID_TrEMOLO.txt
│   │   │   ├── tmp_ID_sniffles.txt
│   │   │   ├── tmp_POSITION_TE_OUTSIDER_CLUSTER_100.bed
│   │   │   ├── tmp_SOFT_CLUSTER_100.bed
│   │   │   ├── tmp_TE_SOFT_NOT_FOUND_IN_sniffles_and_TrEMOLO.bed
│   │   │   ├── tmp_TE_TrEMOLO_NOT_FOUND_IN_sniffles.bed
│   │   │   └── tmp_TrEMOLO_TE_OUTSIDER_CLUSTER_100.bed
│   │   ├── POSITION_START_TE.bed
│   │   └── TE_VR.bed
│   ├── TrEMOLO_SV_TE
│   │   ├── INS
│   │   │   ├── COMBINE_INS_TREMOLO.csv
│   │   │   ├── INS_TREMOLO.bed
│   │   │   ├── INS_TREMOLO.csv
│   │   │   ├── INS_TREMOLO_COUNT.csv
│   │   │   ├── SV_INS.bed
│   │   │   ├── SV_INS_CLUST.bed
│   │   │   ├── SV_INS_CLUST.bln
│   │   │   └── SV_INS_CLUST.fasta
│   │   └── SOFT
│   │       ├── COUNT_SOFT_TE.csv
│   │       ├── ID_ALL_DOUBLON.txt
│   │       ├── ID_NOT_KEEP.txt
│   │       ├── ID_SOFT_EXLUDE.txt
│   │       ├── SOFT_TE.bed
│   │       ├── SOFT_TE.csv
│   │       ├── SV_SOFT.bed
│   │       ├── SV_SOFT.bln
│   │       ├── SV_SOFT.csv
│   │       ├── SV_SOFT.fasta
│   │       ├── SV_SOFT.vcf
│   │       ├── SV_SOFT_COUNT.csv
│   │       ├── SV_SOFT_TE_KEEP.csv
│   │       ├── SV_SOFT_TE_KEEP_V2.csv
│   │       ├── TE_INFO.bed
│   │       └── TE_SOFT.bed
│   ├── VARIANT_CALLING
│   │   ├── SEQUENCE_INDEL.fasta
│   │   └── SV.vcf
│   └── instructions.txt
├── POSITION_TE_OUTSIDER.bed -> OUTSIDER/TE_DETECTION/MERGE_TE/MERGE_TE_ALL.bed
├── POS_TE_INSIDER_ON_REF.bed -> INSIDER/TE_DETECTION/INSERTION_TE_ON_REF.bed
├── SNAKE_USED
│   ├── Snakefile_insider.snk
│   └── Snakefile_outsider.snk
├── SOFT_TE.bed -> OUTSIDER/TrEMOLO_SV_TE/SOFT/SOFT_TE.bed
├── log
│   ├── DETECTION_TE.err
│   ├── DETECTION_TE.log
│   ├── FREQUENCE.err
│   ├── SOFT.err
│   ├── SV_INSIDER.err
│   ├── Snakefile_insider.err
│   ├── Snakefile_insider.log
│   ├── Snakefile_outsider.err
│   ├── Snakefile_outsider.log
│   ├── TE_INSIDER.err
│   ├── TE_INSIDER.out
│   ├── pm_contigs_against_ref.sam.log
│   ├── samtools.err
│   ├── sniffles.err
│   └── sniffles.out
├── params.log
├── params.yaml
├── resume_task.txt
├── tmp_TrEMOLO_output_rule
│   └── rule_tmp_FREQUENCE_COR-014_out
└── tmp_info.txt

20 directories, 122 files

My thanks Cristian

M-D75 commented 1 year ago

Hi,

Sorry for the problem. Indeed, the version of your VCF file seems to be version 4.1 however, it is true that the SVTYPE info may not exist on some lines the script TrEMOLO/lib/python/extract_region_reads_vcf.py did not take this into account unfortunately.

I modified line 164 of the script TrEMOLO/lib/python/extract_region_reads_vcf.py :

type_v  = "<" + re.search("SVTYPE=([A-Z]+)", spl[7]).group(1) + ">"

By this

type_v  = "<" + re.search("SVTYPE=([A-Z]+)", spl[7]).group(1) + ">" if re.search("SVTYPE=([A-Z]+)", spl[7]) else "None"

You can get the update.

This will result in VCF version 4.1 files not processing lines that do not contain the SVTYPE information.

I therefore recommend that you check the OUTSIDER/VARIANT_CALLING/SV.vcf file to see if there are any lines with this information.

For example by running this command

grep -E "SVTYPE=[A-Z]+" OUTSIDER/VARIANT_CALLING/SV.vcf -c

Thanks, Mourdas

cgroza commented 1 year ago

Hi,

Thank you for the fix. The error no longer happens.

However, now I am getting something different a bit later:

/usr/bin/bash: line 83: con8_G2: unbound variable
[Mon Mar 20 09:57:31 2023]
Error in rule FREQUENCE:
    jobid: 13
    output: RAL-091_out/tmp_TrEMOLO_output_rule/rule_tmp_FREQUENCE_RAL-091_out, RAL-091_out/OUTSIDER/FREQ_OPTIMIZED/DEPTH_TE.csv
    log: RAL-091_out/log/FREQUENCE (check log file(s) for error message)
    shell:

I tried to look for con8_G2 in the source code and could not find it. I looked into the TE database fasta I am using and indeed it is one of the repeats:

>con8_G2
AGTTTTTGAACCCTCTGTCGTAGAACACTACTATATCCAGAAATTTTTCGACTTACTTTAGCTCCGTTATTCGCATACCG
TTCACTGCGCCGCGAGACTTCGCCGCGCGCACTGAGCTCAGCCCGCGTGCCTAACGGGCACGCACCAATACACTCGAGCC
GGCCACGTGCAGTGGTTGGTAATACGACCAACTGTACCCAGCTAACCCCCCCCCCCAYWCGAACAATTACCCCTCATCAT
GGATTGGCAGGCCTGCCCCCGCACCAACAGGCCCTGCAAGAAGGCTCTCAGAACAAGGGAATCCAGTCCGAGCAGCGACT
CCAGCACCTCGCATTCAGAGCCCGGAGAGATCAAGCGTAAGCCTGCGCGCAAACCCAAAAAAGACGAGCTAGACGTCACG
CCCAGCACCAGCACAGCCTCGCGACGAAAGTTGACAAACAATCTGTTTGCCATTCTATCGAGCGAAGAGGATGATGATGA
...

What could be causing the BASH script in tremolo to evaluate a TE name as a variable?

My thanks, Cristian

M-D75 commented 1 year ago

There are variables that try to get the name of each TE found.

However, this is rather odd, as the FREQUENCE rule is normally performed before the GET_READS_TE rule which is where you got the first error related to the extract_region_reads_vcf.py script so it's possible that the first time you ran the pipeline this step was successfully passed.

This suggests that the first error you got may have had an impact on the previous rule which seems strange to me. This is also what your directory shows, where the FREQ_OPTIMIZED folder (for rule FREQUENCE) is well filled in contrary to the READ_FASTQ_TE folder (for rule GET_READS_TE)

Perhaps you could restart the whole pipeline from 0? i.e. by emptying your work_directory.

In order to know more I think I would need the log files in the log/ folder at least log/FREQUENCE.err and at best all log files.

my apologies, Mourdas

cgroza commented 1 year ago

I will restart all the pipeline to make sure and report back.

cgroza commented 1 year ago

Hi,

I restarted the pipelines, and some did move forward to TSD detection while some samples still do the same error in the FREQUENCE step.

/usr/bin/bash: line 83: UnFmclCluster039_RLX: unbound variable
[Wed Mar 22 05:11:14 2023]
Error in rule FREQUENCE:
jobid: 13

output: COR-018_out/tmp_TrEMOLO_output_rule/rule_tmp_FREQUENCE_COR-018_out, COR-018_out/OUTSIDER/FREQ_OPTIMIZED/DEPTH_TE.csv

log: COR-018_out/log/FREQUENCE (check log file(s) for error message)

shell:

I would like to share the log/FREQUENCEC.err, but this log file grew to 12GB in size. Not sure if this is normal but sounds excessive. But the tail part of it looks like this:

FREQ=5.084700 : NBD=59 : ID=sniffles.INS.184601
..
....
.
in.. ID_type=sniffles
in...
.....
RS=1 : NBD=71 : ID=sniffles.INS.184659
FREQ=1.408400 : NBD=71 : ID=sniffles.INS.184659
..
....
.
.....
RS=6 : NBD=6 : ID=sniffles.INS.184681
FREQ=100.000000 : NBD=6 : ID=sniffles.INS.184681
..
....
.
.....
RS=4 : NBD=4 : ID=sniffles.INS.184697
FREQ=100.000000 : NBD=4 : ID=sniffles.INS.184697
..
....
.
.....
RS=1 : NBD=6 : ID=sniffles.INS.184862
FREQ=16.666600 : NBD=6 : ID=sniffles.INS.184862
..
....
.
in.. ID_type=sniffles
in...
.....
RS=DBGP_R2-element : NBD=1 : ID=sniffles.INS.184884

Here is the rest of the log folder (attached). log.tar.gz

Cristian

M-D75 commented 1 year ago

OK thanks, I think I know what the problem is, I fixed it in the code, it is due to a problem of frequency calculation of some TE.

I propose you two alternatives, you can make the update git pull, or recovered the clone of the branch v2.2 it is a version or many bug were solved in particular this one on the other hand, there are also other modifications which were applied;

However, all the tests have not yet been carried out so I am not sure that this version is more stable than the one you tested.

for getting v2.2

git clone --branch v2.2 https://github.com/DrosophilaGenomeEvolution/TrEMOLO.git

I hope this helps.

Please feel free to report as many problems as possible.

Thanks, Mourdas