Adamtaranto / teloclip

A tool for the recovery of unassembled telomeres from soft-clipped read alignments.
Other
35 stars 4 forks source link

Execution of Teloclip and input output streams #26

Open ufaroooq opened 3 months ago

ufaroooq commented 3 months ago

Dear Adam,

I hope you are doing well. I just got to know about this tool to find telomers in assembled genomes. I Am trying to run this on one of my assembled genomes but the documentation in readme file is confusing me. I also checked a previously opened issue #20 by @cyycyj but stat also increased some confusions so I will try to put all confusions in one go so yu can help. As from the Documentation I see 3 major steps to follow

Step1: First index the reference assembly

samtools faidx ref.fa

Step2: Streaming SAM records from aligner

minimap2 -ax map-ont ref.fa nanopore.fq | samtools view -h -F 0x100 | teloclip --ref ref.fa.fai | samtools sort > teloclip_overhangs.bam

Step by step as mentioned by @cyycyj in #20

1. minimap2 -ax map-ont ref.fa pacbio.fq.gz > mapped.sam
2. samtools view -h -F 0x100 mapped.sam > FILTERED.SAM
3. teloclip --refIdx ref.fa.fai filtered.sam > teloclip_filtered.sam
5. samtools sort teloclip_filtered.sam -o TELOCLIP_FILTERED.BAM

Step 3: Report clipped alignments containing target motifs

samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | samtools sort > teloclip_TTAGGG_overhangs.bam

Step by step as mentioned by @cyycyj in #20

1. samtools view -h in.bam > header.sam    #convert bam to sam including header information
2. teloclip --ref ref.fa.fai --motifs TTAGGG header.sam > teloclip_TTAGGG_overhangs.sam
3. samtools sort teloclip_TTAGGG_overhangs.sam > teloclip_TTAGGG_overhangs.bam

QUESTION: This in.bam is causig confusion. as @cyycyj had asked already in #20

QUESTION regarding **Matching noisy target motifs*** when using the --fuzzy option with teloclip, i get error teloclip: error: unrecognized arguments: --fuzzy

Step 4: Extract clipped reads

samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs

QUESTION: here the in.bam will be which of the following ?

Major Confusions 1 in #20 you answered as

" the bam or sam that gets passed to teloclip should always be either raw aligments from an alignment tool like minimap2 OR alignments that have been filtered with samtools view to remove low quality alignments."

This is point of confusion, So in both step 2,3,4 the in.bam file should be the either the raw alignment file mapped.sam or filteres alignment file FILTERED.SAM which are generated in step2. can you please shed some light on this.

Major Confusions 2

all these steps should be executed to run teloclip ??

Step1: samtools faidx ref.fa
Step2: minimap2 -ax map-ont ref.fa nanopore.fq | samtools view -h -F 0x100 | teloclip --ref ref.fa.fai | samtools sort > teloclip_overhangs.bam
Step3: samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | samtools sort > teloclip_TTAGGG_overhangs.bam
Step4: samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs

OR step 2 3 4 are just different ways to run teloclip and can be combined into one step as below ?

Step1: samtools faidx ref.fa
Step2: minimap2 -ax map-ont ref.fa nanopore.fq | samtools view -h -F 0x100 | teloclip --ref ref.fa.fai --motifs TTAGGG | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs

If you are able to help with the mwntioned questions it will be great help. Best regards

Adamtaranto commented 3 months ago

In response to your last question, yes these are all just different ways to chain the steps together.

In your example:

# Index the draft genome assembly
samtools faidx ref.fa

# Map ONT reads to reference with minimap2
# Filter out secondary alignments with samtools
# Find reads that overhang contig ends AND contain TTAGGG motif with teloclip
# Write the overhang reads to fasta files with teloclip-extract
minimap2 -ax map-ont ref.fa nanopore.fq | samtools view -h -F 0x100 | teloclip --ref ref.fa.fai --motifs TTAGGG | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs

This will create a dir called SplitOverhangs that has one fasta file of overhang reads for each of your contigs/chromosomes. The overhanging section of each read will be in lower case. You can manually select the best alignment (long high identity anchor on the contig end + longest run of telomeric repeats in the overhang) for each contig and append the overhang.

Chaining all the steps together like this means that you do not write any of the intermediate files to disk.

However, I generally recommend that you run the initial teloclip step with and without the --motifs filter and manually inspect the overhanging alignments in IGV.

You will see that there are different kinds of overhangs depending on how complete your assembly is.

Re the --fuzzy option this is still in development and will replace the --noPoly option. Ignore it for now.

Please use the stable release that is on PyPi via pip install teloclip

ufaroooq commented 3 months ago

Dear Adam,

Thank you for your detailed responce. It clerifies alot of things.