Question - Githubissues

ohan-Bioinfo commented 3 years ago

Dear Greetings,

I'm willing to use the pipeline to identify TEs in Plant genome (RawData) installed from SRA. My question regarding the Input: Reference genome: the genome can be scaffold level? this what has been made from the raw reads? consensus sequences: Transposable elements sequences from DataBases(such as Repbase)? can I use sequences obtained from structure-based tools in the same genome and use it here as consensus?

cbergman commented 3 years ago

Hi @ohan-Bioinfo

Reference genome: the genome can be scaffold level? this what has been made from the raw reads?

The reference genome can be any whole genome assembly format: contig, scaffold or super-scaffolds. Raw reads are used to assemble a reference genome, but reads used to generate the reference assembly are different than the raw reads for the sample you are trying to find TEs in (supplied as arguments to the -1 and -2 option).

consensus sequences: Transposable elements sequences from DataBases(such as Repbase)? can I use sequences obtained from structure-based tools in the same genome and use it here as consensus?

Yes, the file supplied to the -c option is a file of fasta sequences of TEs. These can be obtained from RepBase, de novo or structure based TE discovery tools (i.e. RepeatModeller or LTRharvest), or curated from other sources. Typically you only want one representative TE sequences for a family (either a consensus sequence derived from multiple instances or a canonical sequence that is representative of the family).

Please let us know if this answers your questions.

ohan-Bioinfo commented 3 years ago

an Error occurs after the second-day run: Command:

python3 mcclintock.py --reference Genome.fna --consensus ConsensusUniq.fa --first SRR414989.fastq --out Output/

MissingOutputException in line 28 of /home/ohan/Documents/Tools/mcclintock/snakefiles/ngs_te_mapper2.snakefile:
Job completed successfully, but some output files are missing. Missing files after 5 seconds:
/home/mohan/Documents/Tools/mcclintock/Output/SRR414989/results/ngs_te_mapper2/SRR414989_ngs_te_mapper2_nonredundant.bed
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
  File "/home/ohan/miniconda/envs/mcclintock/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 544, in handle_job_success
  File "/home/ohan/miniconda/envs/mcclintock/lib/python3.7/site-packages

ohan-Bioinfo commented 3 years ago

The reference genome can be any whole-genome assembly format: contig, scaffold or super-scaffolds. Raw reads are used to assemble a reference genome, but reads used to generate the reference assembly are different than the raw reads for the sample you are trying to find TEs in (supplied as arguments to the -1 and -2 option).

May I ask for more clarification here, The tool required to use the reads employed earlier to create the assembly ( which is used here as a reference?) what do you mean by the difference? Raw reads are reads deposited as SRA (Paired) in NCBI? for -1 and -2?

cbergman commented 3 years ago

May I ask for more clarification here, The tool required to use the reads employed earlier to create the assembly ( which is used here as a reference?) what do you mean by the difference? Raw reads are reads deposited as SRA (Paired) in NCBI? for -1 and -2?

The typical process is that a reference genome will be assembled from a set of reads from a sample (call this sample1). The assembly process is done outside of McClintock and is usually a fixed resource used by the community. You will then use reads (paired, or unpaired) from a different sample (e.g. sample2) as input to McClintock to call TE insertions that are present in sample2 but not in the reference genome (so-called "non-reference" insertions).

cbergman commented 3 years ago

regarding the issue in https://github.com/bergmanlab/mcclintock/issues/81#issuecomment-770633267, @pbasting can suggest the next steps. However, I noticed that the sample used in your run (SRR414989) is RNA-Seq data. McClintock expects DNA-seq data as raw read input.

pbasting commented 3 years ago

For the error in https://github.com/bergmanlab/mcclintock/issues/81#issuecomment-770633267, I believe I know the cause and it should be resolved with my most recent mcclintock commit: https://github.com/bergmanlab/mcclintock/commit/043bffde879077d61057cb6b59866f6b4e8d8a99
Updating your mcclintock installation to the newest version should solve this issue.
```
cd <path_to_mcclintock_repo>
git pull
```

bergmanlab / mcclintock

Question #81