DrosophilaGenomeEvolution / TrEMOLO

Transposable Elements MOvement detection using LOng reads
GNU General Public License v3.0
18 stars 5 forks source link

Questions about usage of TrEMOLO #11

Open HLHsieh opened 10 months ago

HLHsieh commented 10 months ago

Hello,

Thank you for providing this remarkable tool. I'm curious about the possibility of utilizing it for the detection of other interspersed repeats beyond TEs. Specifically, I have a collection of motifs that are distributed throughout the genome. My intention is to employ this tool to identify the occurrences of these motifs in various genomic regions.

Additionally, I'd like to inquire if an alternative version of the tool is available that doesn't necessitate compilation. I'm currently working within the supercomputer infrastructure at my university, and unfortunately, I lack the necessary permissions to execute sudo commands. Having a version that doesn't require compilation would greatly facilitate my research.

Thank you for your assistance and consideration.

Best regards, Hsin

M-D75 commented 10 months ago

Hi,

Thank you for your interest in our tool.

Indeed, our tool can be utilized to detect motifs other than TEs throughout the genome. However, it's essential to highlight that the motif size plays a crucial role in the detection process. Depending on the size, the detection accuracy and speed might vary.

Motifs below 30bp won't be detected. Between 30bp and 50bp, if the motifs in the database (DB) diverge significantly from sequences on the reads, there's a risk of missing some detections.

If you have motifs with a size less than 500bp, you'll need to modify one of the default parameters in the .yaml file: change GET_SEQ_REPORT_OPTION: "-m 500" to -m 30.

Also, note that the parameters PARS_BLN_OPTION: "--min-size-percent 80 --min-pident 80" will only output motifs within the reads and/or assembly whose size and identity are >= 80% compared to the motifs in the DB. Having smaller motifs thus reduces the possible diversity margin.

You can also adjust the parameter SIZE_FLANK: 30 to SIZE_FLANK: 5. If you're not targeting Transposable Elements (TEs), the detection of TSDs (Target Site Duplications) becomes less relevant. Reducing this parameter will enhance the processing speed.

Some limitations to be aware of:

Regarding a version that doesn't require compilation, I understand the challenges you face within the supercomputer environment. While our primary version does require compilation, we are considering offering a pre-compiled version or an alternative in the future. For now, here's a pre-compiled version available for download at the following link: TrEMOLO.simg, md5Sum.txt.

Thank you for your feedback, and please don't hesitate to reach out if you have further questions or concerns.

Best regards, M-D

HLHsieh commented 9 months ago

Hi M-D,

I appreciated your clear explanation, and I find this pre-compiled version incredibly beneficial for my research. I have a couple of follow-up questions, and I hope these questions aren't too basic:

1) I've noticed that I need to provide at least two files: GENOME and TE_DB, while the REFERENCE and SAMPLE files are optional. I'm curious about whether the GENOME file is derived from the SAMPLE file, as I might be confusing it with the genome reference. If they are same, do you have any recommendations for assembling the sample?

2) How crucial is the quality of the reads for the analysis? I only have a set of FASTA files without quality scores. I'm contemplating whether I can still use these FASTA files with this algorithm or if it's feasible to simulate quality scores for the FASTA files to convert them into FASTQ files.

Thank you for your assistance.

Best, Hsin

M-D75 commented 9 months ago

Hello,

Thank you for your interest in our tool.

The GENOME file is not necessarily derived from the SAMPLE. This depends on your objective, what you are looking for, as well as the context of your analyses.

To put it simply, the pipeline's goal is to identify differences between a SAMPLE and a GENOME, or between a GENOME and a REFERENCE, or between a SAMPLE+GENOME and a REFERENCE.

If you choose to use a GENOME that's not derived from the SAMPLE along with a SAMPLE, this corresponds to a standard analysis where the GENOME will act as a reference. However, if your SAMPLE significantly diverges from your GENOME, you risk losing information (for instance, some reads might not align). In this specific case, it might be relevant to have a GENOME derived from or closely related to the SAMPLE, and to add a REFERENCE. Context is crucial. For instance, if you've sequenced a single individual whose entire genome is homozygous, there's no need to provide a GENOME derived from the SAMPLE. In the case of a SAMPLE assembly, all the information should be contained within the assembled GENOME.

Regarding the assembly of the sample, it depends on many factors, but you can use the CulebrONT tool which will help you select the best tool for your assemblies.

The quality of the data has only a minor impact. We've tested samples with error rates ranging from 3% to 11%, and it hasn't made much difference. You won't be able to directly upload a .fasta file as some steps won't work. However, you can convert your .fasta files into .fastq by simulating the quality.

For example, with this command:


awk '/^>/ {
    if (seq != "") {
        print seq
        print "+"
        for (i = 1; i <= length(seq); i++) {
            printf "?"
        }
        print ""
    }
    gsub(/^>/, "@")
    print
    seq = ""
    next
}
{
    seq = seq $0
}
END {
    print seq
    print "+"
    for (i = 1; i <= length(seq); i++) {
        printf "?"
    }
    print ""
}' /path/to/your_fasta_file.fasta > /path/to/output_fastq_file.fastq

Does this command apply the same quality (?) to all reads.

I hope these answers are clear and that they will be useful to you.

don't hesitate for any other question.

Best, M-D

HLHsieh commented 7 months ago

Hi M-D,

Thank you for providing the script for converting .fasta and .fastq files; it resolved the issue with my simulated data. As mentioned earlier, I aim to detect interspersed repeats in my simulated data, each approximately 100 bp in length. Although my dataset contains simulated interspersed repeats, I am unable to detect them using the tool (Please see below for errors). I'm curious if there are limitations beyond repeat length or if I may have executed the tool incorrectly.

•••
 [SNK]--[Thu Oct 19 22:27:50 EDT 2023] GET INSERTION TrEMOLO [^-^] 
•••
[Thu Oct 19 22:27:50 EDT 2023] LOG TASK /scratch/stimulated_NanoSim_2x/TrEMOLO_test_2/log/sniffles.out, /scratch/stimulated_NanoSim_2x/TrEMOLO_test_2/log/sniffles.err
GET INSERTION...
 [TrEMOLO_SV_TE] ERROR : NO INSERTION FOUND
AN ERROR OCCURRED

[SNK INFO] ERROR PIPELINE; snakefile used : /scratch/stimulated_NanoSim_2x/TrEMOLO_test_2/SNAKE_USED/Snakefile_outsider.snk
    Check LOG   : /scratch/stimulated_NanoSim_2x/TrEMOLO_test_2/log/Snakefile_outsider.log
    Check ERROR : /scratch/stimulated_NanoSim_2x/TrEMOLO_test_2/log/Snakefile_outsider.err
/usr/bin/bash: line 176: kill: (544323) - No such process
Removing temporary output file /scratch/stimulated_NanoSim_2x/TrEMOLO_test_2/rep_tmp_snk.
[Thu Oct 19 22:28:09 2023]
Finished job 0.
1 of 1 steps (100%) done
Complete log: /hsinlun/bin/.snakemake/log/2023-10-19T222211.940054.snakemake.log
CHECKING OF DB TE...

Thank you for your assistance and consideration.

Best regards, Hsin

M-D75 commented 7 months ago

Hi,

Thank you for reporting this issue. Indeed, the default minimum size for insertion detection was set to 200 bp for the TrEMOLO_SV_TE step. An update has been made, and the minimum size is now set to 30 bp. We apologize for this inconvenience.

After retrieving the update (using git pull or reclone the TrEMOLO pipeline), you should be able to identify the patterns in question, provided that they are not also present in your reference genome (the GENOME parameter).

Please note that the new update requires a new Singularity container, which you can download using this link TrEMOLO.simg.

Thank you again, and please do not hesitate to report back if the problem continues.

Best regards, M-D