EMBL-Hentze-group / htseq-clip

a toolset designed for the processing and analysis of eCLIP/iCLIP dataset
MIT License
3 stars 1 forks source link

Does *htseq-clip count* respect splicing pattern? #8

Closed piechottam closed 5 months ago

piechottam commented 1 year ago

Thank you for the tool!

I was wondering if htseq-clip createMatrix takes into account the splicing pattern of reads when counting crosslink sites?

How does the flag (0, 1, 2, 3) in htseq-clip annotation influence downstream processing in: annotation -> createSlidingWindows -> mapToId -> extract -> count? It this used for filtering at any step?

Distue commented 1 year ago

Hi Michael,

Thank you for writing and your interest in htseq-clip.

When extracting the truncation sites (crosslink sites), single-nucleotide positions are extracted and counted independently of the rest of the read. Therefore, we do focus on single-nucleotide resolution analysis, but the read name is still retained so that it can be looked up where this truncation came from.

In general, eCLIP reads are rather short, often between 20-50nt, which is not giving much information for splicing. CLIP data can be sparse and noisy and the detection of binding sites is rather tricky, we rather focus on more robust detection than splicing.

However, technically, one can run htseq-clip with a custom gff3 on a transcriptome alignment, then you could detect isoform specific binding sites. However, we have not tested it and unless you have excellent eCLIP data, this might be too difficult.

In case you are interested in splicing, I would do following: First, take SMI and RNA-seq data of the cell line you are working with, analyse it with a salmon/kallisto or any other pseudoaligner and create a custom gff3 file with only the selected transcripts present in the cell line. Alvis Brazma studies showed that often there are only very few isoforms present. Then run the htseq-clip/DEWSeq workflow to detect robust binding sites. The bed output of DEWSeq can be viewed in IGV along with bam files containing the read information. When viewing target binding regions you can first check if your binding sites have enough information to detect splicing events, and if so, you might write a small script quantifying those. Again, I personally would check your binding sites with IGV if those events even occur and this is worth the effort.

The flag is a bit of a legacy product. Since we flatten gene annotation, some of the exon boundaries are shared by all available isoforms, some are not. htseq-clip used to have a function which calculates the distances to each boundary and outputs for each crosslink site the distance to the 5' and 3' boundry. With the flag one could filter for distances with only "save" exon boundries and then plot a distribution of crosslink sites to exon (or intron) boundries. If the community is interested, we might enable that feature again, but currently, it is not used in the standard workflow.

What we use for filtering sometimes is the max crosslink sites on one single-nucleotide position (height) along with the crosslink site count parameters. This can be effective for filtering out windows with just random background, when your protein displays a binding behaviour with crosslink sites being more stacked (enriched over a small area vs enriched over a larger area).

If you have any questions, please let us know, we are happy to help.

Best, To

On Fri, Dec 9, 2022 at 2:22 PM Michael Piechotta @.***> wrote:

Thank you for the tool!

I was wondering if htseq-clip createMatrix takes into account the splicing pattern of reads when counting crosslink sites?

How does the flag (0, 1, 2, 3) in htseq-clip annotation influence downstream processing in: annotation -> createSlidingWindows -> mapToId -> extract -> count? It this used for filtering at any step?

— Reply to this email directly, view it on GitHub https://github.com/EMBL-Hentze-group/htseq-clip/issues/8, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG2FZG74NXYCVDXBMR6JJDWMMXCTANCNFSM6AAAAAASZK642Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>