Open MrBleem opened 1 month ago
Hi,
Yes, indeed, TrEMOLO is supposed to work on the human genome.
Thank you for reporting this issue. An update resolving this problem is now available. You can retrieve it by running git pull
or by recloning the repository.
Please don't hesitate to let us know if it worked or if you encounter any other issues.
Best, M-D
hmmm, I have reclone the repository, but there is still this issue. It has been running for 80 minutes.
If I test this commands python3 ./TrEMOLO/lib/python/format_files/fasta_to_fasta.py GRCh38_no_alt.fa > test.fa
, it just takes few minutes.
And this is my config.yaml
DATA:
REFERENCE: "./GRCh38_no_alt.fa" #reference genome (fasta file)
GENOME: "./GRCh38_no_alt.fa" #genome (fasta file)
SAMPLE: "./test_ont.50X.fa" #long reads (a fastq file)
WORK_DIRECTORY: "./test_ont" #name of output directory
TE_DB: "./GRCh38.transposon.fa" #Database of TE (a fasta file)
CHOICE:
PIPELINE:
OUTSIDER_VARIANT: True
INSIDER_VARIANT: False
REPORT: False
OUTSIDER_VARIANT:
CALL_SV: "sniffles" #Posibiliti (sniffles, svim)
INTEGRATE_TE_TO_GENOME: True
CLIPPED_READS: False
INSIDER_VARIANT:
DETECT_ALL_TE: True
INTERMEDIATE_FILE: True
PARAMS:
THREADS: 20
OUTSIDER_VARIANT:
MINIMAP2:
PRESET_OPTION: 'map-ont' # -x minimap2 preset option is map-pb by default (map-pb, map-ont etc)
OPTION: ''
SAMTOOLS_VIEW:
PRESET_OPTION: ''
SAMTOOLS_SORT:
PRESET_OPTION: ''
SAMTOOLS_CALLMD:
PRESET_OPTION: ''
TSD:
SIZE_FLANK: 10
TE_DETECTION:
CHROM_KEEP: "." #regular expresion of chromosome; exemple for Drosophila "2L,2R,3R,3L,X" ; put "." for keep all chromosome
GET_SEQ_REPORT_OPTION: "-m 30" #option get_seq_vcf.py option du fichier de récupération des séquences dans le vcf
PARS_BLN_OPTION: "--min-size-percent 80 --min-pident 80 -k 'INS|DEL' "
INSIDER_VARIANT:
PARS_BLN_OPTION: "--min-size-percent 80 --min-pident 80"
MINIMAP2:
PRESET_OPTION: 'asm5' # minimap2 preset option is asm5 by default (asm5, asm10, asm20 etc)
OPTION: '--cs'
Is ther any problem?
Hmm 🤔
Do you have the same issue with the test dataset ?
Are you sure the pipeline is blocking at that point ?
Could I see the log files (log/Snakefile_outsider.err
, log/Snakefile_outsider.log
) ?
Also, I noticed that in the SAMPLE
variable, you've used a FASTA
(test_ont.50X.fa) file instead of a FASTQ
file. This might cause an issue, though probably not at this step (I think).
I also see that you are not running the INSIDER
part. Therefore, you may not need the two options: INTEGRATE_TE_TO_GENOME
and DETECT_ALL_TE
, which you can set to False
. By doing this, the part of the pipeline that is blocking will not be executed. This might solve your issue.
This can be done on fly's genome and test dataset. Only get this issue on human genome.
•••
ESC[96m [SNK]--[Wed Sep 25 09:30:40 AM EDT 2024] PUT TE OUTSIDER ON GENOME... ESC[0m
•••
[Wed Sep 25 09:30:40 AM EDT 2024] LOG TASK test_ont/log/TE_TOWARD_GENOME.out, test_ont/log/TE_TOWARD_GENOME.err REFORMAT FASTA GENOME FOR TE INTEGRATION...
- log/Snakefile_outsider.err:
/data/tusers/boxu/Z_Z/TrEMOLO/Snakefile:190: SyntaxWarning: invalid escape sequence '\/'
/data/tusers/boxu/Z_Z/TrEMOLO/Snakefile:1119: SyntaxWarning: invalid escape sequence '\/'
/data/tusers/boxu/Z_Z/TrEMOLO/Snakefile:2090: SyntaxWarning: invalid escape sequence '-' log: /data/tusers/boxu/Z_Z/TrEMOLO/Snakefile:2391: SyntaxWarning: invalid escape sequence '-' awk 'NR>1 {{print $2":"$1":"$5}}' {output.all_te} | awk -F ":" 'OFS="\t"{{print $1, $3, ($3>=$4 ? $3+1 : $4), $10" | "$5, $11, $9}}' | bedtools sort > {params.work_directory}/OUTSIDER/TE_DETECTION/POSITION_START_TE.bed /data/tusers/boxu/Z_Z/TrEMOLO/Snakefile:2824: SyntaxWarning: invalid escape sequence '.' /data/tusers/boxu/Z_Z/TrEMOLO/Snakefile:3059: SyntaxWarning: invalid escape sequence '.' /data/tusers/boxu/Z_Z/TrEMOLO/Snakefile:5072: SyntaxWarning: invalid escape sequence '-' /data/tusers/boxu/Z_Z/TrEMOLO/Snakefile:5117: SyntaxWarning: invalid escape sequence '-' Assuming unrestricted shared filesystem usage. Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 32 Rules claiming more threads will be scaled down. Job stats: job count |
---|
REPORT 1 TE_TOWARD_GENOME 1 total 2
Select jobs to execute... Execute 1 jobs...
[Wed Sep 25 09:30:40 2024] localrule TE_TOWARD_GENOME: input: test_ont/tmp_TrEMOLO_output_rule/rule_tmp_TSD_OUTSIDER_simulation_germ_ont, test_ont/INPUT/GRCh38.transposon.fa, test_ont/INPUT/GRCh38_no_alt.fa, test_ont/OUTSIDER/TE_DETECTION/FILTER_BLAST_SEQUENCE_INDEL_vs_DBTE.csv, test_ont/OUTSIDER/VARIANT_CALLING/SEQUENCE_INDEL.fasta output: test_ont/OUTSIDER/TE_TOWARD_GENOME/PSEUDO_GENOME_TE_DB_ID.fasta, test_ont/OUTSIDER/TE_TOWARD_GENOME/TRUE_POSITION_TE_PSEUDO.bed, test_ont/tmp_TrEMOLO_output_rule/rule_tmp_TE_TOWARD_GENOME_simulation_germ_ont log: test_ont/log/TE_TOWARD_GENOME jobid: 1 reason: Missing output files: test_ont/tmp_TrEMOLO_output_rule/rule_tmp_TE_TOWARD_GENOME_simulation_germ_ont resources: tmpdir=/tmp
Now I have set INTEGRATE_TE_TO_GENOME and DETECT_ALL_TE to False and test it.
It works! Thank you very much! 😄
Best Bo
😁 Great ! Thanks for reporting the problem, we'll try to find the source of the bug.
Best, M-D
Hi! Thanks for your work.
Is TrEMOLO suitable for application to the human genome?
I'm tring to detect insertions in human by TrEMOLO, but the program always stays at processing the genome.fasta.
python3 ./TrEMOLO/lib/python/format_files/fasta_to_fasta.py $out_path/TrEMOLO/test/INPUT/GRCh38_no_alt.fa