DrosophilaGenomeEvolution / TrEMOLO

Transposable Elements MOvement detection using LOng reads
GNU General Public License v3.0
19 stars 5 forks source link

Tremolo not finishing in 48+ hours #20

Open lilypeck opened 6 months ago

lilypeck commented 6 months ago

Hello

Thank you for this great tool.

I am trying to call TE insertions on a tree genome which is ~ 830 Mb.

My config.yaml file is as follows:

# all path can be relatif or absolute
DATA:
    REFERENCE:       "/u/home/l/ldpeck/genome_resources/GCF_001633185.2_ValleyOak3.2_genomic.fna"   #reference genome (fasta file) only if INSIDER_VARIANT = True [optional]
    GENOME:          "/u/project/vlsork/ldpeck/longreads/flye/ragtag/medaka_ass/barcode03.ragtag_out/ragtag.scaffold.fasta"  #genome (fasta file) [required]
    SAMPLE:          "/u/project/vlsork/ldpeck/longreads/fastq/barcode03_ALLpass.fastq"       #long reads (a fastq[.gz] file) only if OUTSIDER_VARIANT = True [optional]
    WORK_DIRECTORY:  "TrEMOLO_OUTPUT_barcode03"                  #name of output directory [required or empty]
    TE_DB:           "/u/home/l/ldpeck/genome_resources/Qlobata.v3.0.RepeatModeler-open-1.0.8.consensi.fa.classified"      #Database of TE (a fasta file) [required]

CHOICE:
    PIPELINE:
    OUTSIDER_VARIANT: True  # TE no assembled (out of genome)
        INSIDER_VARIANT: True   # TE assembled (in genome)
        REPORT: True            # for getting report.html with graph
    OUTSIDER_VARIANT:
        CALL_SV: "sniffles" # possibility (sniffles, svim)
        INTEGRATE_TE_TO_GENOME: True # (True, False) Re-build the assembly with outsiders integrated in
        CLIPPED_READS: False # (True, False) Processing of clipped reads (SOFT, HARD)
    INSIDER_VARIANT:
        DETECT_ALL_TE: False    # detect ALL TE on genome (parameter GENOME) assembly not only new insertion. Warning! it may be take several hours on big genomes
    INTERMEDIATE_FILE: True     # to keep the intermediate analysis files to process them.

PARAMS:
    THREADS: 8 #number of threads for some task
    OUTSIDER_VARIANT:
        MINIMAP2:
            PRESET_OPTION: 'map-ont' # minimap2 preset option is map-ont by default (map-pb, map-ont etc)
            OPTION: '' # more option of minimap2
        SAMTOOLS_VIEW:
            PRESET_OPTION: ''
        SAMTOOLS_SORT:
            PRESET_OPTION: ''
        SAMTOOLS_CALLMD:
            PRESET_OPTION: ''
        TSD:
            SIZE_FLANK: 20  # flanking sequence size to calculate TSD put value >= 4
        TE_DETECTION:
            CHROM_KEEP: "." # regular expresion of chromosome; exemple  for Drosophila  "[23][RL],4,X" ; Put "." for keep all chromosome
            GET_SEQ_REPORT_OPTION: "-m 30" #option get_seq_vcf.py option du fichier de récupération des séquences dans le vcf
        PARS_BLN_OPTION: "--min-size-percent 80 --min-pident 80 -k 'INS|DEL'" # option of TrEMOLO/pipeline/lib/python/parse_blast_main.py  Warning d'ont put -c option
    INSIDER_VARIANT:
        PARS_BLN_OPTION: "--min-size-percent 80 --min-pident 80"
        MINIMAP2:
            PRESET_OPTION: 'asm5' # minimap2 preset option is asm5 by default (asm5, asm10, asm20 etc)
            OPTION: '--cs'

I run this file with a job script as follows:

apptainer exec TrEMOLO.simg snakemake --snakefile TrEMOLO/run.snk --configfile scripts/barcode03_ragtag.yaml

Tremolo seems to be running okay, but it doesn't finish after 48 hours. When I re-run the script, first with --unlock and then with --rerun-incomplete it says that TrEMOLO_OUTPUT_barcode03/OUTSIDER/VARIANT_CALLING/SV.vcf seems to be incomplete (see tremolo-run.sh.o3303866). How do I check if this file is complete, it has 171991 lines? The only way I can restart the script is to delete this file and start it again, in which case it doesn't finish again for 48 hours (see tremolo-run.sh.o3313962).

Is it normal to take so long to run? I am mostly interested in TE insertions, so alternately is it possible to switch off calling SV's in case this speeds it up?

Thank you!

Lily

tremolo-run.sh.o3278150.txt tremolo-run.sh.o3313962.txt

M-D75 commented 6 months ago

Hello,

Thank you for your feedback.

48 hours is indeed abnormally long, especially for Sniffles. However, I just realized that Sniffles was always running with 3 threads. An update is available that allows you to choose the number of threads by increasing the THREADS parameter in the config.yaml file.

To get the new update:

git pull
#or
git clone https://github.com/DrosophilaGenomeEvolution/TrEMOLO.git

However, "171991 lines" is very few. I think I have encountered this problem before, and I believe it was due to one of the versions of Singularity I had used. Changing the version solved the problem, but I am not too sure. I will try to identify the issue.

Best regards. M-D

M-D75 commented 6 months ago

Did you encounter the same problem with the test datasets?

M-D

lilypeck commented 5 months ago

Hi M-D

Sorry for my slow reply. The test datasets ran fine, I did not encounter the same problem. See attached .log file.

I am running tremolo through singularity, is it possible to update the .simg image?

Thank you

Lily

tremolo-test.sh.o3424412.txt

lilypeck commented 5 months ago

Hi @M-D75 Please do let me know if you identified the issue, or if the .simg image has also been updated? Thanks Lily

M-D75 commented 5 months ago

I am really really sorry for the delayed response.

On my end, the issue seems to potentially be hardware-related, but I am not sure. I ran the same analysis on a cluster on different compute nodes, and that problem it occurs on a few nodes generally identified as having old hardware. Do you run your analyses on a compute cluster? Have you tried running them on another node?

Again, I apologize for the delayed response. here's what I can do : I can modify the Singularity container by including a new version of Sniffles or, we could provide an option to skip Sniffles and only retain the extraction of INDELs indicated by the CIGAR in the alignment file, but it will take a few days to update the pipeline and perform tests.

Another question: did you build the container (the .simg image) yourself, or did you get it from this link?

"... is it possible to update the .simg image?"

What kind of update did you have in mind ? Sniffles update ?

Sorry again, M-D

lilypeck commented 5 months ago

Hi @M-D75

No problem, thank you for your reply.

Yes I run my analyses on a computing cluster. Each time I run it, the system automatically signs it to a node which 99% of the time is a different node to previously.

I downloaded the .simg image from the link. The update I referenced was the sniffles one you suggested in your original response. Happy with whichever option you think is best! I can have a go and let you know if it has worked or not?

Thank you for your help!

Lily

M-D75 commented 5 months ago

Hi,

An update is available. Simply replace CALL_SV: "sniffles" with CALL_SV: "no_sniffles" in your .yaml configuration file. This will run the pipeline without the sniffles part. Risk: Lower TE detection.

I hope this resolves your issue.

Do not hesitate to report any other issues. There are still other updates to come.

Best, M-D

lilypeck commented 5 months ago

Hi @M-D75

Thank you very much for your help.

I have re-run my script, but I am getting the following error message RE SVs, could this be caused by running without sniffles? (see output file for full details)

MissingInputException in line 2923 of /u/project/vlsork/ldpeck/tremolo/TrEMOLO/Snakefile:
Missing input files for rule TrEMOLO_SV_TE:
TrEMOLO_OUTPUT_barcode03/OUTSIDER/VARIANT_CALLING/SV.vcf

[SNK INFO] DRY RUN ERROR PIPELINE : please check your config file

Thank you!

Lily

barcode03_ragtag.yaml.txt tremolo-run.sh.o3877363.txt

M-D75 commented 5 months ago

Hi,

Sorry, first you need to download the update :

git pull
#or
git clone https://github.com/DrosophilaGenomeEvolution/TrEMOLO.git

Then you can restart the pipeline from the new update. Let me know if it works.

Best, M-D

lilypeck commented 4 months ago

Hi @M-D75

Thank you, I have now updated both the .simg and the .git

However it is still getting stuck on calling SVs, it started on 2nd July and is still running today, but the .log hasn't updated since 3rd July.

My runscript was apptainer exec TrEMOLO.simg snakemake --snakefile TrEMOLO/run.snk --configfile scripts/barcode03_ragtag.yaml

See attached files. Any help would be much appreciated.

Thank you

Lily

barcode03_ragtag.yaml.txt tremolo-run.sh.o3879674.txt

M-D75 commented 4 months ago

Hi,

This is strange. I will consider other alternatives, as I am currently having difficulty identifying the problem. I will contact you again if any changes are made.

Sorry, M-D

M-D75 commented 4 months ago

Hi,

Sorry, I haven’t found a solution for now. It is difficult when we cannot reproduce the bug in question. I would like to have more information. Could you please send me the OUTSIDER/MAPPING/stats.txt file from your analysis ?

Best, M-D

lilypeck commented 4 months ago

Hi M-D

Thank you very much. I have attached the stats.txt file.

Let me know.

Thanks

Lily

On 15 Jul 2024, at 01:56, M-D75 @.***> wrote:

Hi,

Sorry, I haven’t found a solution for now. It is difficult when we cannot reproduce the bug in question. I would like to have more information. Could you please send me the OUTSIDER/MAPPING/stats.txt file from your analysis ?

Best, M-D

— Reply to this email directly, view it on GitHub https://github.com/DrosophilaGenomeEvolution/TrEMOLO/issues/20#issuecomment-2228006067, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ3VAW4HSGOLYMFV4Q2IU5DZMOFDJAVCNFSM6AAAAABII6VCGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRYGAYDMMBWG4. You are receiving this because you authored the thread.

raw total sequences: 4280038 filtered sequences: 0 sequences: 4280038 is sorted: 0 1st fragments: 4280038 last fragments: 0 reads mapped: 4279191 reads mapped and paired: 0 # paired-end technology bit set + both mates mapped reads unmapped: 847 reads properly paired: 0 # proper-pair bit set reads paired: 0 # paired-end technology bit set reads duplicated: 0 # PCR or optical duplicate bit set reads MQ0: 48492 # mapped and MQ=0 reads QC failed: 0 non-primary alignments: 4994972 total length: 38779490476 # ignores clipping total first fragment length: 38779490476 # ignores clipping total last fragment length: 0 # ignores clipping bases mapped: 38778690649 # ignores clipping bases mapped (cigar): 38511917654 # more accurate bases trimmed: 0 bases duplicated: 0 mismatches: 3155994396 # from NM fields error rate: 8.194851e-02 # mismatches / bases mapped (cigar) average length: 9060 average first fragment length: 9061 average last fragment length: 0 maximum length: 221426 maximum first fragment length: 0 maximum last fragment length: 0 average quality: 33.5 insert size average: 0.0 insert size standard deviation: 0.0 inward oriented pairs: 0 outward oriented pairs: 0 pairs with other orientation: 0 pairs on different chromosomes: 0 percentage of properly paired reads (%): 0.0

M-D75 commented 4 months ago

Hi,

38Gb of data isn't much for 48 hours

To avoid restarting everything, can you test this :

apptainer exec TrEMOLO.simg sniffles -t 15 --report-seq -s 1 -m /path/to/your/work_directory/OUTSIDER/MAPPING/SAMPLE_mapping_GENOME_MD.sorted.bam -v /path/to/your/OUTSIDER/VARIANT_CALLING/SV.vcf -n -1

-t 15 for 15 threads you can modify according to your capacity

replacing /path/to/your/work_directory/ accordingly.

I would like to know if the issue with the 48hours comes directly from the attempt to extract SVs on your data, or if it is something else like an outdated version of Snakemake. If it's the latter, I can update as many programs as necessary.

if it takes more than 24 hours, there's no point in continuing, given the amount of data, it should take less than 24 hours

thanks, M-D

lilypeck commented 4 months ago

Hi M-D

Thank you very much for your help.

I ran the below script and it runs for over 24 hours, see attached job log.

Let me know if there is something else I can try.

Thanks

Lily

On 21 Jul 2024, at 15:56, M-D75 @.***> wrote:

Hi,

38Gb of data isn't much for 48 hours

To avoid restarting everything, can you test this :

apptainer exec TrEMOLO.simg sniffles -t 15 --report-seq -s 1 -m /path/to/your/work_directory/OUTSIDER/MAPPING/SAMPLE_mapping_GENOME_MD.sorted.bam -v /path/to/your/OUTSIDER/VARIANT_CALLING/SV.vcf -n -1 -t 15 for 15 threads you can modify according to your capacity

replacing /path/to/your/work_directory/ accordingly.

I would like to know if the issue with the 48hours comes directly from the attempt to extract SVs on your data, or if it is something else like an outdated version of Snakemake. If it's the latter, I can update as many programs as necessary.

if it takes more than 24 hours, there's no point in continuing, given the amount of data, it should take less than 24 hours

thanks, M-D

— Reply to this email directly, view it on GitHub https://github.com/DrosophilaGenomeEvolution/TrEMOLO/issues/20#issuecomment-2241801557, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ3VAWY7RL6U3RFLBWLGUN3ZNQ4CFAVCNFSM6AAAAABII6VCGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBRHAYDCNJVG4. You are receiving this because you authored the thread.

M-D75 commented 4 months ago

Hi,

There will be an update this Tuesday, with a few small changes to package versions that I hope will resolve the issue.

Best, M-D

M-D75 commented 3 months ago

Hi,

Sorry i forgot to tell you.

Updated on a new branch.

Command to fetch the update:

git clone -b fix_issue_23 https://github.com/DrosophilaGenomeEvolution/TrEMOLO.git

However, you should rebuild the Singularity (aptainer) container with this update.

sudo singularity build TrEMOLO.simg TrEMOLO/Singularity

some packages have been updated, in particular the package responsible for parsing the alignment file

you can then just try running the sniffles command as before :

apptainer exec TrEMOLO.simg sniffles -t 15 --report-seq -s 1 -m /path/to/your/work_directory/OUTSIDER/MAPPING/SAMPLE_mapping_GENOME_MD.sorted.bam -v /path/to/your/OUTSIDER/VARIANT_CALLING/SV.vcf -n -1

Best, M-D

lilypeck commented 3 months ago

Hi M-D

Thank you very much for your help. Could I check did you update the pre-compiled singularity container, I don’t have sudo rights as I’m using a server, so previously I downloaded the pre-compiled container. I have tried this update but it still takes >24 hours to complete.

Thanks

Lily

On 7 Aug 2024, at 05:22, M-D75 @.***> wrote:

Hi,

Sorry i forgot to tell you.

Updated on a new branch.

Command to fetch the update:

git clone -b fix_issue_23 https://github.com/DrosophilaGenomeEvolution/TrEMOLO.git However, you should rebuild the Singularity (aptainer) container with this update.

sudo singularity build TrEMOLO.simg TrEMOLO/Singularity some packages have been updated, in particular the package responsible for parsing the alignment file

you can then just try running the sniffles command as before :

apptainer exec TrEMOLO.simg sniffles -t 15 --report-seq -s 1 -m /path/to/your/work_directory/OUTSIDER/MAPPING/SAMPLE_mapping_GENOME_MD.sorted.bam -v /path/to/your/OUTSIDER/VARIANT_CALLING/SV.vcf -n -1 Best, M-D

— Reply to this email directly, view it on GitHub https://github.com/DrosophilaGenomeEvolution/TrEMOLO/issues/20#issuecomment-2273343203, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ3VAWZAPBN4DL7RPFHAT6DZQIGORAVCNFSM6AAAAABII6VCGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZTGM2DGMRQGM. You are receiving this because you authored the thread.

M-D75 commented 3 months ago

Hi,

Sorry, I'll send you a download link.

Sorry again, M-D

M-D75 commented 3 months ago

here the link

Best, M-D

lilypeck commented 2 months ago

Hi @M-D75 I'm sorry for the delay, could you please re-send as the link has expired? Thanks Lily

M-D75 commented 2 months ago

Hi,

No problem, new link

Best, M-D

lilypeck commented 2 months ago

Hi @M-D75

Unfortunately it is still not finishing in 24 hours, see attached files. Would it help if I shared a google drive folder containing my input files?

thanks

Lily

stats.txt

tremolo-test.sh.o4828185.txt

tremolo-test.sh.txt

M-D75 commented 2 months ago

Hi,

Yes, it would help a lot if you could share the input files. I have a small idea of the potential problem. Having the input data would allow me to verify my hypothesis.

Thanks, M-D

lilypeck commented 2 months ago

Great thank you, are you able to share an email address please and I will send you a link to the folder?

Thanks

Lily

M-D75 commented 2 months ago

Yes, of course.

mourdas.mohamed[]ird.fr

lilypeck commented 2 months ago

Great, I have the shared the folder with your email

https://drive.google.com/drive/folders/14GMXEKuCr2k6BN4Pdrbl9O4NGYmFNlKE?usp=sharing

M-D75 commented 2 months ago

Thank you, I got it. I will keep you informed whether I have found the solution or not.

M-D75 commented 1 month ago

Hi,

I was able to check a few things. Sniffles takes about 5 days with 20 threads on your data. There is no blocking or bug as I initially thought; it's just very slow. So, in the end, it's mostly an optimization issue. I have a few ideas that I will test to improve the speed, but I am still unsure if this will be at the risk of losing some information and, if so, to what extent.

I haven't run the entire pipeline on your data yet, but i imagine it would take more than a week, which is far too long. However, by skipping some lengthy steps, except for Sniffles, I believe we could reduce the processing time to 5 days, albeit at the cost of a less comprehensive TE detection.

I will test different options, and if I get complete data, I will send it to you along with the fix so you can apply it to other datasets.

Thank you for your patience, M-D

lilypeck commented 1 month ago

Hi @M-D75

Thank you very much for your help! If possible, complete data + fix would be great. I will wait to hear from you.

Thanks

Lily

M-D75 commented 1 month ago

hi,

Just to let you know that I finally found the problem and was able to improve the speed on certain points, I'll let you know more once all the checks are complete.

Thanks for your patience, M-D

M-D75 commented 1 month ago

Hi,

I wanted to inform you that I was able to execute the pipeline on your data after modifying the code. The entire execution took about 14 hours using 20 CPUs. And the part that was lengthy and problematic took 4 hours.

I will send you a link to retrieve the output. A version of the tool will be provided with specific instructions, as it is not yet fully stable for all cases. I will continue to conduct tests and improve performance in certain areas.

Best, M-D

lilypeck commented 1 month ago

Hi,

That is great news, thank you very much

Thanks

Lily

On 24 Oct 2024, at 05:34, M-D75 @.***> wrote:

Hi,

I wanted to inform you that I was able to execute the pipeline on your data after modifying the code. The entire execution took about 14 hours using 20 CPUs. And the part that was lengthy and problematic took 4 hours.

I will send you a link to retrieve the output. A version of the tool will be provided with specific instructions, as it is not yet fully stable for all cases. I will continue to conduct tests and improve performance in certain areas.

— Reply to this email directly, view it on GitHub https://github.com/DrosophilaGenomeEvolution/TrEMOLO/issues/20#issuecomment-2435174032, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ3VAW3AYG7AZDIFUAAQW3DZ5DSOXAVCNFSM6AAAAABII6VCGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZVGE3TIMBTGI. You are receiving this because you authored the thread.

M-D75 commented 4 weeks ago

Hi,

Could you give me an e-mail address so that I can share the link. I'm having trouble uploading some of the data.

Best, M-D

lilypeck commented 4 weeks ago

Hi,

Thank you! It is ldpeck[at]ucla.edu http://ucla.edu/

Thanks

Lily

On 1 Nov 2024, at 15:18, M-D75 @.***> wrote:

Hi,

Could you give me an e-mail address so that I can share the link. I'm having trouble uploading some of the data.

Best, M-D

— Reply to this email directly, view it on GitHub https://github.com/DrosophilaGenomeEvolution/TrEMOLO/issues/20#issuecomment-2452663807, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ3VAW2SSZTXPCNPFI4UB3LZ6P44VAVCNFSM6AAAAABII6VCGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJSGY3DGOBQG4. You are receiving this because you authored the thread.

M-D75 commented 2 weeks ago

Hi,

The main tests are complete. You can clone the fix_issue_23 branch as follows:

Using the command:

git clone https://github.com/DrosophilaGenomeEvolution/TrEMOLO.git -b fix_issue_23

Or from your existing repository:

git pull
git checkout fix_issue_23

For the Singularity container, please use the previous version available on the Git repository.

Important notes:

A new parameter, TIME_LIMIT, has been added to the configuration file. It specifies the maximum number of hours you are willing to allocate for the task of retrieving potential TE insertions. If the value is set to 0, there will be no time limit. However, with the modifications made to the pipeline, 4 hours were sufficient using 20 threads. You can keep the value CALL_SV: no_sniffles. Do not enable CLIPPED_READS (CLIPPED_READS: True), as it would result in excessively long processing times for your data.

I hope this will work for you. I would be interested to know if it does. Thanks again for your help.

Thanks, M-D

lilypeck commented 2 weeks ago

Hi M-D

Thank you very much for your time with this fix.

I haven’t been able to download the files yet as I am travelling, once I’m back in the office I can download them.

I will let you know how I get on.

Thanks

Lily

On 13 Nov 2024, at 01:40, M-D75 @.***> wrote:

The main tests are complete. You can clone the fix_issue_23 branch as follows:

Using the command:

git clone https://github.com/DrosophilaGenomeEvolution/TrEMOLO.git -b fix_issue_23 Or from your existing repository:

git pull git checkout fix_issue_23 For the Singularity container, please use the previous version available on the Git repository https://github.com/DrosophilaGenomeEvolution/TrEMOLO/releases/download/v2.5.4b/TrEMOLO.simg.

Important notes:

A new parameter, TIME_LIMIT, has been added to the configuration file. It specifies the maximum number of hours you are willing to allocate for the task of retrieving potential TE insertions. If the value is set to 0, there will be no time limit. However, with the modifications made to the pipeline, 4 hours were sufficient using 20 threads. You can keep the value CALL_SV: no_sniffles. Do not enable CLIPPED_READS (CLIPPED_READS: True), as it would result in excessively long processing times for your data.

I hope this will work for you. I would be interested to know if it does. Thanks again for your help.

Thanks, M-D

— Reply to this email directly, view it on GitHub https://github.com/DrosophilaGenomeEvolution/TrEMOLO/issues/20#issuecomment-2472996784, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ3VAW2CY7IGAVP5CSNMBFL2AMNBLAVCNFSM6AAAAABII6VCGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZSHE4TMNZYGQ. You are receiving this because you authored the thread.

M-D75 commented 2 weeks ago

Hi,

noted, I think the link has expired I'll generate another.

M-D

lilypeck commented 3 days ago

Hi M-D

I am now back in the office and able to download.

Thanks

Lily

On 13 Nov 2024, at 23:04, M-D75 @.***> wrote:

Hi,

noted, I think the link has expired I'll generate another.

M-D

— Reply to this email directly, view it on GitHub https://github.com/DrosophilaGenomeEvolution/TrEMOLO/issues/20#issuecomment-2475567145, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ3VAW4ALA4WLZHMCXSDY532ARDQVAVCNFSM6AAAAABII6VCGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZVGU3DOMJUGU. You are receiving this because you authored the thread.

M-D75 commented 2 days ago

Hi,

OK I'll prepare the links.

M-D