DrosophilaGenomeEvolution / TrEMOLO

Transposable Elements MOvement detection using LOng reads
GNU General Public License v3.0
19 stars 5 forks source link

Apply TrEMOLO to human #24

Open MrBleem opened 1 month ago

MrBleem commented 1 month ago

Hi! Thanks for your work.

Is TrEMOLO suitable for application to the human genome?

I'm tring to detect insertions in human by TrEMOLO, but the program always stays at processing the genome.fasta. python3 ./TrEMOLO/lib/python/format_files/fasta_to_fasta.py $out_path/TrEMOLO/test/INPUT/GRCh38_no_alt.fa

M-D75 commented 1 month ago

Hi,

Yes, indeed, TrEMOLO is supposed to work on the human genome. Thank you for reporting this issue. An update resolving this problem is now available. You can retrieve it by running git pull or by recloning the repository.

Please don't hesitate to let us know if it worked or if you encounter any other issues.

Best, M-D

MrBleem commented 1 month ago

hmmm, I have reclone the repository, but there is still this issue. It has been running for 80 minutes. If I test this commands python3 ./TrEMOLO/lib/python/format_files/fasta_to_fasta.py GRCh38_no_alt.fa > test.fa, it just takes few minutes.

And this is my config.yaml

DATA: 
    REFERENCE:       "./GRCh38_no_alt.fa"   #reference genome (fasta file)
    GENOME:          "./GRCh38_no_alt.fa"   #genome (fasta file)
    SAMPLE:          "./test_ont.50X.fa"    #long reads (a fastq file)
    WORK_DIRECTORY:  "./test_ont"      #name of output directory
    TE_DB:           "./GRCh38.transposon.fa"   #Database of TE (a fasta file)

CHOICE:
    PIPELINE:
        OUTSIDER_VARIANT: True
        INSIDER_VARIANT: False
        REPORT: False
    OUTSIDER_VARIANT:
        CALL_SV: "sniffles" #Posibiliti (sniffles, svim)
        INTEGRATE_TE_TO_GENOME: True
        CLIPPED_READS: False
    INSIDER_VARIANT:
        DETECT_ALL_TE: True
    INTERMEDIATE_FILE: True

PARAMS:
    THREADS: 20
    OUTSIDER_VARIANT:
        MINIMAP2:
            PRESET_OPTION: 'map-ont' # -x minimap2 preset option is map-pb by default (map-pb, map-ont etc)
            OPTION: ''
        SAMTOOLS_VIEW:
            PRESET_OPTION: ''
        SAMTOOLS_SORT:
            PRESET_OPTION: ''
        SAMTOOLS_CALLMD:
            PRESET_OPTION: ''
        TSD:
            SIZE_FLANK: 10
        TE_DETECTION:
            CHROM_KEEP: "."  #regular expresion of chromosome; exemple  for Drosophila  "2L,2R,3R,3L,X" ; put "." for keep all chromosome
            GET_SEQ_REPORT_OPTION: "-m 30" #option get_seq_vcf.py option du fichier de récupération des séquences dans le vcf
        PARS_BLN_OPTION: "--min-size-percent 80 --min-pident 80 -k 'INS|DEL' "
    INSIDER_VARIANT:
        PARS_BLN_OPTION: "--min-size-percent 80 --min-pident 80"
        MINIMAP2:
            PRESET_OPTION: 'asm5' # minimap2 preset option is asm5 by default (asm5, asm10, asm20 etc)
            OPTION: '--cs'

Is ther any problem?

M-D75 commented 1 month ago

Hmm 🤔

Do you have the same issue with the test dataset ? Are you sure the pipeline is blocking at that point ? Could I see the log files (log/Snakefile_outsider.err, log/Snakefile_outsider.log) ? Also, I noticed that in the SAMPLE variable, you've used a FASTA (test_ont.50X.fa) file instead of a FASTQ file. This might cause an issue, though probably not at this step (I think).

I also see that you are not running the INSIDER part. Therefore, you may not need the two options: INTEGRATE_TE_TO_GENOME and DETECT_ALL_TE, which you can set to False. By doing this, the part of the pipeline that is blocking will not be executed. This might solve your issue.

MrBleem commented 1 month ago

This can be done on fly's genome and test dataset. Only get this issue on human genome.

[Wed Sep 25 09:30:40 AM EDT 2024] LOG TASK test_ont/log/TE_TOWARD_GENOME.out, test_ont/log/TE_TOWARD_GENOME.err REFORMAT FASTA GENOME FOR TE INTEGRATION...


- log/Snakefile_outsider.err:

/data/tusers/boxu/Z_Z/TrEMOLO/Snakefile:190: SyntaxWarning: invalid escape sequence '\/'

/data/tusers/boxu/Z_Z/TrEMOLO/Snakefile:1119: SyntaxWarning: invalid escape sequence '\/'

FOR DETECTION ON ASM

/data/tusers/boxu/Z_Z/TrEMOLO/Snakefile:2090: SyntaxWarning: invalid escape sequence '-' log: /data/tusers/boxu/Z_Z/TrEMOLO/Snakefile:2391: SyntaxWarning: invalid escape sequence '-' awk 'NR>1 {{print $2":"$1":"$5}}' {output.all_te} awk -F ":" 'OFS="\t"{{print $1, $3, ($3>=$4 ? $3+1 : $4), $10" "$5, $11, $9}}' bedtools sort > {params.work_directory}/OUTSIDER/TE_DETECTION/POSITION_START_TE.bed /data/tusers/boxu/Z_Z/TrEMOLO/Snakefile:2824: SyntaxWarning: invalid escape sequence '.' /data/tusers/boxu/Z_Z/TrEMOLO/Snakefile:3059: SyntaxWarning: invalid escape sequence '.' /data/tusers/boxu/Z_Z/TrEMOLO/Snakefile:5072: SyntaxWarning: invalid escape sequence '-' /data/tusers/boxu/Z_Z/TrEMOLO/Snakefile:5117: SyntaxWarning: invalid escape sequence '-' Assuming unrestricted shared filesystem usage. Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 32 Rules claiming more threads will be scaled down. Job stats: job count

REPORT 1 TE_TOWARD_GENOME 1 total 2

Select jobs to execute... Execute 1 jobs...

[Wed Sep 25 09:30:40 2024] localrule TE_TOWARD_GENOME: input: test_ont/tmp_TrEMOLO_output_rule/rule_tmp_TSD_OUTSIDER_simulation_germ_ont, test_ont/INPUT/GRCh38.transposon.fa, test_ont/INPUT/GRCh38_no_alt.fa, test_ont/OUTSIDER/TE_DETECTION/FILTER_BLAST_SEQUENCE_INDEL_vs_DBTE.csv, test_ont/OUTSIDER/VARIANT_CALLING/SEQUENCE_INDEL.fasta output: test_ont/OUTSIDER/TE_TOWARD_GENOME/PSEUDO_GENOME_TE_DB_ID.fasta, test_ont/OUTSIDER/TE_TOWARD_GENOME/TRUE_POSITION_TE_PSEUDO.bed, test_ont/tmp_TrEMOLO_output_rule/rule_tmp_TE_TOWARD_GENOME_simulation_germ_ont log: test_ont/log/TE_TOWARD_GENOME jobid: 1 reason: Missing output files: test_ont/tmp_TrEMOLO_output_rule/rule_tmp_TE_TOWARD_GENOME_simulation_germ_ont resources: tmpdir=/tmp



Now I have set INTEGRATE_TE_TO_GENOME and DETECT_ALL_TE to False and test it.
MrBleem commented 1 month ago

It works! Thank you very much! 😄

Best Bo

M-D75 commented 1 month ago

😁 Great ! Thanks for reporting the problem, we'll try to find the source of the bug.

Best, M-D