bergmanlab / mcclintock

Meta-pipeline to identify transposable element insertions using next generation sequencing data
93 stars 30 forks source link

Question #81

Closed ohan-Bioinfo closed 3 years ago

ohan-Bioinfo commented 3 years ago

Dear Greetings,

I'm willing to use the pipeline to identify TEs in Plant genome (RawData) installed from SRA. My question regarding the Input: Reference genome: the genome can be scaffold level? this what has been made from the raw reads? consensus sequences: Transposable elements sequences from DataBases(such as Repbase)? can I use sequences obtained from structure-based tools in the same genome and use it here as consensus?

cbergman commented 3 years ago

Hi @ohan-Bioinfo

Reference genome: the genome can be scaffold level? this what has been made from the raw reads?

The reference genome can be any whole genome assembly format: contig, scaffold or super-scaffolds. Raw reads are used to assemble a reference genome, but reads used to generate the reference assembly are different than the raw reads for the sample you are trying to find TEs in (supplied as arguments to the -1 and -2 option).

consensus sequences: Transposable elements sequences from DataBases(such as Repbase)? can I use sequences obtained from structure-based tools in the same genome and use it here as consensus?

Yes, the file supplied to the -c option is a file of fasta sequences of TEs. These can be obtained from RepBase, de novo or structure based TE discovery tools (i.e. RepeatModeller or LTRharvest), or curated from other sources. Typically you only want one representative TE sequences for a family (either a consensus sequence derived from multiple instances or a canonical sequence that is representative of the family).

Please let us know if this answers your questions.

ohan-Bioinfo commented 3 years ago

an Error occurs after the second-day run: Command:

python3 mcclintock.py --reference Genome.fna --consensus ConsensusUniq.fa --first SRR414989.fastq --out Output/

MissingOutputException in line 28 of /home/ohan/Documents/Tools/mcclintock/snakefiles/ngs_te_mapper2.snakefile:
Job completed successfully, but some output files are missing. Missing files after 5 seconds:
/home/mohan/Documents/Tools/mcclintock/Output/SRR414989/results/ngs_te_mapper2/SRR414989_ngs_te_mapper2_nonredundant.bed
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
  File "/home/ohan/miniconda/envs/mcclintock/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 544, in handle_job_success
  File "/home/ohan/miniconda/envs/mcclintock/lib/python3.7/site-packages
ohan-Bioinfo commented 3 years ago

The reference genome can be any whole-genome assembly format: contig, scaffold or super-scaffolds. Raw reads are used to assemble a reference genome, but reads used to generate the reference assembly are different than the raw reads for the sample you are trying to find TEs in (supplied as arguments to the -1 and -2 option).

May I ask for more clarification here, The tool required to use the reads employed earlier to create the assembly ( which is used here as a reference?) what do you mean by the difference? Raw reads are reads deposited as SRA (Paired) in NCBI? for -1 and -2?

cbergman commented 3 years ago

May I ask for more clarification here, The tool required to use the reads employed earlier to create the assembly ( which is used here as a reference?) what do you mean by the difference? Raw reads are reads deposited as SRA (Paired) in NCBI? for -1 and -2?

The typical process is that a reference genome will be assembled from a set of reads from a sample (call this sample1). The assembly process is done outside of McClintock and is usually a fixed resource used by the community. You will then use reads (paired, or unpaired) from a different sample (e.g. sample2) as input to McClintock to call TE insertions that are present in sample2 but not in the reference genome (so-called "non-reference" insertions).

cbergman commented 3 years ago
pbasting commented 3 years ago