Closed cgroza closed 5 months ago
Hi,
Sorry for the problem. Indeed, the version of your VCF file seems to be version 4.1 however, it is true that the SVTYPE info may not exist on some lines the script TrEMOLO/lib/python/extract_region_reads_vcf.py
did not take this into account unfortunately.
I modified line 164 of the script TrEMOLO/lib/python/extract_region_reads_vcf.py
:
type_v = "<" + re.search("SVTYPE=([A-Z]+)", spl[7]).group(1) + ">"
By this
type_v = "<" + re.search("SVTYPE=([A-Z]+)", spl[7]).group(1) + ">" if re.search("SVTYPE=([A-Z]+)", spl[7]) else "None"
You can get the update.
This will result in VCF version 4.1 files not processing lines that do not contain the SVTYPE information.
I therefore recommend that you check the OUTSIDER/VARIANT_CALLING/SV.vcf
file to see if there are any lines with this information.
For example by running this command
grep -E "SVTYPE=[A-Z]+" OUTSIDER/VARIANT_CALLING/SV.vcf -c
Thanks, Mourdas
Hi,
Thank you for the fix. The error no longer happens.
However, now I am getting something different a bit later:
/usr/bin/bash: line 83: con8_G2: unbound variable
[Mon Mar 20 09:57:31 2023]
Error in rule FREQUENCE:
jobid: 13
output: RAL-091_out/tmp_TrEMOLO_output_rule/rule_tmp_FREQUENCE_RAL-091_out, RAL-091_out/OUTSIDER/FREQ_OPTIMIZED/DEPTH_TE.csv
log: RAL-091_out/log/FREQUENCE (check log file(s) for error message)
shell:
I tried to look for con8_G2 in the source code and could not find it. I looked into the TE database fasta I am using and indeed it is one of the repeats:
>con8_G2
AGTTTTTGAACCCTCTGTCGTAGAACACTACTATATCCAGAAATTTTTCGACTTACTTTAGCTCCGTTATTCGCATACCG
TTCACTGCGCCGCGAGACTTCGCCGCGCGCACTGAGCTCAGCCCGCGTGCCTAACGGGCACGCACCAATACACTCGAGCC
GGCCACGTGCAGTGGTTGGTAATACGACCAACTGTACCCAGCTAACCCCCCCCCCCAYWCGAACAATTACCCCTCATCAT
GGATTGGCAGGCCTGCCCCCGCACCAACAGGCCCTGCAAGAAGGCTCTCAGAACAAGGGAATCCAGTCCGAGCAGCGACT
CCAGCACCTCGCATTCAGAGCCCGGAGAGATCAAGCGTAAGCCTGCGCGCAAACCCAAAAAAGACGAGCTAGACGTCACG
CCCAGCACCAGCACAGCCTCGCGACGAAAGTTGACAAACAATCTGTTTGCCATTCTATCGAGCGAAGAGGATGATGATGA
...
What could be causing the BASH script in tremolo to evaluate a TE name as a variable?
My thanks, Cristian
There are variables that try to get the name of each TE found.
However, this is rather odd, as the FREQUENCE rule is normally performed before the GET_READS_TE rule which is where you got the first error related to the extract_region_reads_vcf.py
script so it's possible that the first time you ran the pipeline this step was successfully passed.
This suggests that the first error you got may have had an impact on the previous rule which seems strange to me. This is also what your directory shows, where the FREQ_OPTIMIZED folder (for rule FREQUENCE) is well filled in contrary to the READ_FASTQ_TE folder (for rule GET_READS_TE)
Perhaps you could restart the whole pipeline from 0? i.e. by emptying your work_directory
.
In order to know more I think I would need the log files in the log/
folder at least log/FREQUENCE.err
and at best all log files.
my apologies, Mourdas
I will restart all the pipeline to make sure and report back.
Hi,
I restarted the pipelines, and some did move forward to TSD detection while some samples still do the same error in the FREQUENCE step.
/usr/bin/bash: line 83: UnFmclCluster039_RLX: unbound variable
[Wed Mar 22 05:11:14 2023]
Error in rule FREQUENCE:
jobid: 13
output: COR-018_out/tmp_TrEMOLO_output_rule/rule_tmp_FREQUENCE_COR-018_out, COR-018_out/OUTSIDER/FREQ_OPTIMIZED/DEPTH_TE.csv
log: COR-018_out/log/FREQUENCE (check log file(s) for error message)
shell:
I would like to share the log/FREQUENCEC.err
, but this log file grew to 12GB in size. Not sure if this is normal but sounds excessive.
But the tail part of it looks like this:
FREQ=5.084700 : NBD=59 : ID=sniffles.INS.184601
..
....
.
in.. ID_type=sniffles
in...
.....
RS=1 : NBD=71 : ID=sniffles.INS.184659
FREQ=1.408400 : NBD=71 : ID=sniffles.INS.184659
..
....
.
.....
RS=6 : NBD=6 : ID=sniffles.INS.184681
FREQ=100.000000 : NBD=6 : ID=sniffles.INS.184681
..
....
.
.....
RS=4 : NBD=4 : ID=sniffles.INS.184697
FREQ=100.000000 : NBD=4 : ID=sniffles.INS.184697
..
....
.
.....
RS=1 : NBD=6 : ID=sniffles.INS.184862
FREQ=16.666600 : NBD=6 : ID=sniffles.INS.184862
..
....
.
in.. ID_type=sniffles
in...
.....
RS=DBGP_R2-element : NBD=1 : ID=sniffles.INS.184884
Here is the rest of the log
folder (attached).
log.tar.gz
Cristian
OK thanks, I think I know what the problem is, I fixed it in the code, it is due to a problem of frequency calculation of some TE.
I propose you two alternatives, you can make the update git pull
, or recovered the clone of the branch v2.2 it is a version or many bug were solved in particular this one on the other hand, there are also other modifications which were applied;
However, all the tests have not yet been carried out so I am not sure that this version is more stable than the one you tested.
for getting v2.2
git clone --branch v2.2 https://github.com/DrosophilaGenomeEvolution/TrEMOLO.git
I hope this helps.
Please feel free to report as many problems as possible.
Thanks, Mourdas
Hello,
Great pipeline! I am trying TrEMOLO with drosophila genome assemblies and ONT long reads. The pipeline works fine until this point:
Seems to me the regex search fails but the
extract_region_reds_vcf.py
script still tries to access the result? Do you know what could be causing this error and how could we fix it?This is the state of the pipeline working directory when it exists:
My thanks Cristian