Adamtaranto / teloclip

A tool for the recovery of unassembled telomeres from soft-clipped read alignments.
Other
35 stars 4 forks source link

Confusing about input and output file #20

Open cyycyj opened 10 months ago

cyycyj commented 10 months ago

Dear Adam,

It is amazing to find such a great tool for telomere anchoring! But I would like to say some expression you mentioned in the tutorial is blurred, especially the input and output file (like in.sam, out.sam and out.bam...). If you can clarify them, I would appreciate it. By the way, do you have any recommendations about citing your work, just github link or you have published it?

Thanks

Adamtaranto commented 10 months ago

Hi Andrew,

There is some minimal test data in tests/data that you can run through the example code in the readme.

In the most basic use case teloclip only requires two inputs: 1) A fasta file index .fai that tells us how long each contig is, and; 2) Some reads that have been aligned to the assembly as a .sam file.

teloclip will search the input samfile for alignments which extend past the ends of your assembly contigs and then filter these into a new output samfile.

In some of the examples I show the sam output being piped to samtools sort and stored as a bamfile. A bam file is just a binary samfile, it generally takes up less space on disk. You can easily convert between bam and sam using samtools view.

This tool has not been formally published yet. If you find it useful you can cite the git repo directly and note the release version used in your analysis.

User feedback is helpful for improving the tool, so please do let me know if you have and difficulties / successes using it.

cyycyj commented 10 months ago

Dear Adam,

Thank you for your warm help! And I will try it. By the way, I have tried to break down the command into separate steps, please have a look and check it:

Streaming SAM records from aligner

This part based on the command minimap2 -ax map-pb ref.fa pacbio.fq.gz | samtools view -h -F 0x2308 | teloclip --ref ref.fa.fai | samtools sort > out.bam

Mapping pacbio reads to ref genome by using minimap2

minimap2 -ax map-pb ref.fa pacbio.fq.gz > mapped.sam

Filtering the SAM file using samtools view to exclude specific flags

samtools view -h -F 0x2308 mapped.sam > filtered.sam

Identifying unassembled telomeric repeats in the alignments

teloclip --refIdx ref.fa.fai filtered.sam > teloclip_filtered.sam Is filtered.sam refer to in.sam in teloclip --ref ref.fa.fai in.sam ?

Identifying unassembled telomeric repeats in the alignments, and then filter it and output to a bam file

samtools sort teloclip_filtered.sam -o out.bam

Report clipped alignments containing target motifs

This part based on the command samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | samtools sort > out.bam

extract the header information from the BAM file

samtools view -h in.bam > header.sam confusing about in.bam. Where it is come from? Is in.bam actually the out.bam, which served as the output of command 'samtools sort teloclip_filtered.sam -o out.bam'?

identify unassembled telomeric repeats(TTAGGG), and save results in a new SAM file

teloclip --ref ref.fa.fai --motifs TTAGGG header.sam > teloclip_filtered.sam

sort the alignments by reference position, and saves the sorted data as a BAM file

samtools sort teloclip_filtered.sam > out.bam

samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGGTTAGGGTTAGGGTTAGGGTTAGGG | samtools sort > out.bam Is this an optional step?

samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs A little bit confusing about in.bam. Where it is come from? Maybe the output of samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | samtools sort > out.bam' ?

It is currently midnight in Melbourne. Please take a rest first. Have a good night, and thank you very much. :)

Adamtaranto commented 10 months ago

That's pretty much correct.

Streaming SAM records from aligner

Filtering the SAM file using samtools view to exclude specific flags

samtools view -h -F 0x2308 mapped.sam > filtered.sam

Note: Don't use flag 0x2308 to filter, I've updated the docs to recommend 0x100 which just excludes secondary alignments.

Filtering is optional.

Identifying unassembled telomeric repeats in the alignments

teloclip --refIdx ref.fa.fai filtered.sam > teloclip_filtered.sam Is filtered.sam refer to in.sam in teloclip --ref ref.fa.fai in.sam ?

Yes, in this example the filtered.sam is what you would use in place of in.sam

Identifying unassembled telomeric repeats in the alignments, and then filter it and output to a bam file

samtools sort teloclip_filtered.sam -o out.bam

samtools sort just sorts the overhang-read alignments (sam format) detected by teloclip and converts them to bam format.

Report clipped alignments containing target motifs

extract the header information from the BAM file

samtools view -h in.bam > header.sam confusing about in.bam. Where it is come from? Is in.bam actually the out.bam, which served as the output of command 'samtools sort teloclip_filtered.sam -o out.bam'?

The bam file here is the same as the raw sam alignments you would get from minimap2. The sam file has just been converted to bam format.

samtools view converts it back to sam format that can be read by teloclip

The -h flag tells samtools view to include the file heard lines in the output along with the alignments.

identify unassembled telomeric repeats(TTAGGG), and save results in a new SAM file

teloclip --ref ref.fa.fai --motifs TTAGGG header.sam > teloclip_filtered.sam

This is the same as before but the --motifs option tells teloclip to only report overhang-reads with the motif TTAGGG in the clipped (overhanging) part of the alignment.

Note: TTAGGG is a fungal telomeric motif. You should choose a motif specific to your species.

sort the alignments by reference position, and saves the sorted data as a BAM file

samtools sort teloclip_filtered.sam > out.bam

Yep.

samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGGTTAGGGTTAGGGTTAGGGTTAGGG | samtools sort > out.bam Is this an optional step?

This is an alternative to the previous example, the only difference is that it requires a minimum of 5 repeats of the TTAGGG motif in the clipped region to return a match.

Note: I've corrected the readme example to describe this use as well as the --nopoly option which will ignore homopolymer tracks.

samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs A little bit confusing about in.bam. Where it is come from? Maybe the output of samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | samtools sort > out.bam' ?

No, the bam or sam that gets passed to teloclip should always be either raw aligments from an alignment tool like minimap2 OR alignments that have been filtered with samtools view to remove low quality alignments.

cyycyj commented 10 months ago

Dear Adam,

Thanks a lot! And I have successfully get the overhangs. And I would like to ask if you have some recommendation on extending contig? Maybe I need to map the overhangs to its corresponding chromosome, and generate the bam file for IGV?

If you are interesting on it, I can provide the files

cyycyj commented 10 months ago

Oops, I have checked the overhangs, it seems that the extracted overhang reads' structure is as below:

sequence of chromosome end - telomere motif repeat - unknown sequence

I have upload all the overhang reads in google drive, how can I share it with you if you need them?

Adamtaranto commented 10 months ago

For extending the contigs/scaffolds I would do something along these lines:

1) Align the long reads to your assembly and filter for clipped-overhangs. This will give you all possible reads that might be useful for extending contigs. (Note: Use minimap setting appropriate to your data type).

minimap2 -ax map-pb ref.fa pacbio.fq.gz | teloclip --ref ref.fa.fai | samtools sort > all_overhangs.bam

2) Inspect the clipped alignments in IGV. (You might need to index the sorted bamfile first.)

In this step you are specifically checking to see:
- How well the reads are anchored on the ends of your contigs. You want to see a decent (subjective, maybe >500bp?) amount of the read aligned to the contig end.
- Few secondary alignments. If many contigs terminate with the same repeat or low-complexity sequence then reads may map non-specifically to many contigs ends.
- How well the aligned reads agree with each other. Both in the anchored segment and the overhang. If the overhangs all agree then you can be confident that the alignments are correct.

3) Optional: Quality filter the alignments if required.

After viewing the alignments you may decide to: 
- Filter out secondary alignments
- Re-run `teloclip` using the `--motifs` search option
- Manually exclude certain low quality or poorly anchored alignments.
# Filter previous alignments to keep only primary alignments that have at least 1 run of TTAGGGx3 in the clipped region.
samtools view -h -F 0x100 all_overhangs.bam | teloclip --ref ref.fa.fai --motifs TTAGGGTTAGGGTTAGGG | samtools sort > TTAGGGx3_overhang_primary_alignments.bam

4) Run the alignments (sam/bam) from teloclip through teloclip-extract with the --extractReads option

samtools view -h TTAGGGx3_overhang_primary_alignments.bam | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs
This will give you a fasta file for each end of each contig that has at least one clipped-overhang. 
These fasta files will contain the extracted reads with the "clipped" segment masked in lowercase characters.

5) Manually extend the contigs.

If the overhangs agree with each other you can just copy the longest clipped segment and append it to the end of the relevant contig. Be sure to check that you are pasting the sequence in the correct orientation depending on the end of the contig you are extending.

6) Optional: Polish the Assembly.

If you have short read data available you may want to polish the final assembly to correct any errors in the extensions.

Note: This is not required if you are using HiFi data as it is just as good as Illumina data.

Your strategy might be slightly different depending on the quality of your starting assembly (how was it created? Have you already done scaffolding with Hi-C or long-read data?) as fragmented assemblies will have many clipped-overhangs that are NOT telomeres.

You should also consider the expected telomere structure/motifs for your species. Some species do not have standard telomeric motifs.

Adamtaranto commented 10 months ago

Can you share some of the extracted reads from teloclip-extract containing the telomeric motif here?

You can also paste screenshots from IGV here.

cyycyj commented 10 months ago

Dear Adam,

No problem, please wait a moment, I have a meeting today. By the way I have made a mistake-overhanging region of each read is in lowercase

cyycyj commented 10 months ago

Dear Adam,

Please check script and overhang sequence: https://drive.google.com/drive/folders/14PMFaGdDTcj6mwMILGV7JlLVT_x8c6ap?usp=drive_link

cyycyj commented 10 months ago

Another question is about the contig end. For example, if we conduct juicebox pipeline, once we modify the end of chromosome (or "superscaffold") wrongly, there will be a situation as below:

                           Left end of reads-----telomeric motif (right end of reads)
Current chromosome end------------------------------------------Actual chromosome end

Do you have any recoomendation on it? Actually I think teloclip can be quite flexible - we can use it for contig extend or even gap closing. I believe teloclip will be a widely used tools!

Adamtaranto commented 10 months ago

Scaffolds can be validated with Hi-C data, and/or by aligning your long read data back to the scaffolded assembly (after gap-filling). In the later case, you would see a bunch of reads soft-clipped at the mis-assembly site.

In your example above, teloclip will not report the read as it does not overhang the end of a scaffold.

I expect that teloclip will work best on assemblies that have already been (correctly) scaffolded with Hi-C.

btw you need to approve access to your gdrive folder.

cyycyj commented 10 months ago

Thank you, and I apologize for forgetting to grant you access. I have updated the overhangs sequence. By the way, I noticed that you mentioned using miniasm to generate a contig for manually extending (https://github.com/Adamtaranto/teloclip/issues/10#issuecomment-1790298067). Do you have any hints on how to do that?

Adamtaranto commented 10 months ago

In this case I think your overhangs look pretty consistent with each other.

I would try this:

  1. For each scaffold select the longest overhang (lowercase segment) from the teloclip-extract reads for each scaffold end (L/R).
  2. Manually append the longest overhang segment to the relevant scaffold end in your draft assembly (check correct orientation). Keep note of how much sequence you have added to each scaffold.
  3. Re-map all the reads to this updated assembly.
  4. View alignments in IGV. Inspect points where new sequence was appended on each scaffold - do multiple reads span these sites?

It looks like there is some variation in your telomeric motifs. I see CCCTGAA as well as CCCTAAA. I guess this is a plant?

It might be worthwhile running teloclip without motifs as well to see if you pick up extra reads.

Once you have your final assembly you could run one of the tools for de novo telomeric repeat prediction.

cyycyj commented 10 months ago

Dear Adam,

You are right! I am working on a plant genome(~450Mb, het=1.29, 2n=30), and I would follow your advice. But I would like to say some longest sequence seem to have limited number of telomeric motifs(such as Hic_scaffold_10_R). Maybe we can check & polish it with IGV and racon

Adamtaranto commented 10 months ago

Most of your scaffolds with at least 1 overhang detected by teloclip have a read depth of between 1-17X at the end of the scaffold.

10R and 11L have a depth of >450X. I think the ends of these scaffolds are probably repeats and you are getting non-specific alignment. You should inspect the coverage (of all reads mapped to the whole assembly) in IGV to see if these locations just have lots of alignments. At least for 10R I don't think it is a chromosome end.

14R has >90X depth and should probably be checked too.

You could also try aligning your reads with Winnowmap which might handle repeats better than Minimap2. I haven't tested it with teloclip yet so let me know how it changes your results.

Adamtaranto commented 10 months ago

The high coverage scaffold ends might also be mitochondrial and chloroplast genomes come to think of it. You should identify and exclude those scaffolds.

cyycyj commented 8 months ago

Dear Adam,

Apologies for the late reply. I've been occupied with a complex work issue and my wedding ceremony in recent days. I will continue to try it and provide you with feedback.