Mesh89 / TranSurVeyor

A transposition caller.
10 stars 2 forks source link

About output interpretation #3

Closed kaanokay closed 3 years ago

kaanokay commented 3 years ago

Hi,

First, I would like to thank you for this nice tool.

I am wondering about a sentence meaning mentioned on the GitHub page.

You wrote about a transposition event "the example transposition is predicted to be inserted into chromosome 5, between 135114483 and 135114528, and the predicted inserted sequence is chr12:59936709-59937336."

Does this sentence mean that transposable insertion in chromosome 5 (chr5:135114483-135114528) originated from chromosome 12 (chr12:59936709-59937336) in the same genome, or versa vice? These events are the same transposition event, also? Additionally, we assume that these events are the same, but they have different sequence length, why?

I am confused, If possible, can you explain output in detail?

Another question is about discordant read filtering. The same IDs have different discordant reads for the same transposition. For example; The first line of ID1 has 66 discordant reads, but the second line of it has 55 discordant reads. When I want to filter discordant reads (e.g., >60), then 55 discordant reads of ID1 are removed, but 66 discordant reads of ID1 are kept. Is it a true way to filter reads? How can I properly filter transposition events according to discordant reads? Should the filtering process be specific for each line? Namely, when I removed the second line containing 55 discordant reads, should the first line of the same ID containing 66 discordant reads be kept?

I look forward to your responses.

Best wishes.

Mesh89 commented 3 years ago

Hi,

It means that (according to TranSurVeyor), sequence chr12:59936709-59937336 was inserted in chr5, in a position between 135114483 and 135114528. So there is an insertion in chr5.

For the second question, I don't think there is a definitive answer. It depends on what you want to accomplish I recommend using the output from ./filter. Because the number of discordant pairs depends on the coverage for the region, it uses the ratio between the number of discordant pairs and the number of concordant pairs. If the number of concordant pairs is much higher, it will be filtered. If for some reason you only want to use a hard threshold on the number of discordant pairs (e.g. 60 in your example), I would keep ID=1 if either of the breakpoints have at least 60. So in the example, I would accept both breakpoints for ID=1.

kaanokay commented 3 years ago

Dear @Mesh89,

Thanks for your detailed response.

I have another question about the origin of transpositions. Are BP3 breakpoints possible for transposition in an output? If BP3 breakpoints are available, then BP2 breakpoints originated from BP3; however, BP1 breakpoints originating from BP2. Actually BP1 and BP2 breakpoints are remnant of BP3 breakpoints according to my assumption. Is this assumption true theoretically? If the output doesn't contain such information for BP3 breakpoints, then can I find them from other lines?

My last question is about transposition events with a single unique ID. For example; ID3 has only a single line containing BP1=F:chr15:85140909 BP2=F:chr6:133347858 information and no second line containing other BP1 and BP2 breakpoints. Should such transposition events be filtered (ignored) for downstream analyses due to their lack of information on transposition of interest? Should I focus on only entire transposition events with two identical IDs?

Thanks for your interest.

Mesh89 commented 3 years ago

Sorry for the delayed response.

Every line reported represents a "junction", i.e. two locations (BP1 and BP2) that are far away on the reference but adjacent in the sample. You are right about your assumption: a transposition is described by two junctions with the same ID. When predicting a transposition, TranSurVeyor requires that two compatible junctions are found. However, some of them are filtered in the filtering step. I guess these are more likely to be false positive, and I would ignore them. If you think you have a single junction that is correct, you can try getting the unfiltered output with

./filter $workdir no-filter

And look for the paired junction by ID. I plan to investigate why there are many singleton junctions when I make a new version of the software.

I am not sure I understand your example about BP3. Do you have a figure that illustrates the example?

kaanokay commented 3 years ago

Screenshot from 2020-10-20 10-55-24

Tubio, Jose MC, et al. "Extensive transduction of nonrepetitive DNA mediated by L1 retrotransposition in cancer genomes." Science 345.6196 (2014).

These figures represent transposable insertions from a genomic region to others. For example, 22q12 located transposable element jumped to other chromosomes. TranSurVeyor provides such information, but I am wondering whether jumped transposable elements originated from the 22q12 region jump to another genomic region, also? and such third transposable jumping event information is available in the output of TranSurVeyor? In such a scenario, BP3 breakpoints represent 22q12 located transposable element, BP2 breakpoints represent second inserted transposable elements (like in above figure), and then BP1 breakpoints represent the last inserted transposable elements.

I hope that this was more understandable.

Thanks for your interest.

Mesh89 commented 3 years ago

I am not sure I understand correctly, but essentially your scenario is that a single transposable element jumped to multiple locations?

In that case, they are considered different transpositions. Each transposition, for TranSurVeyor, is a separata insertion. If 22q12 insert into 3 locations, they will be 3 different transpositions, each (hopefully) with two junction sequences describing it.

Sorry if I misunderstood your example.

kaanokay commented 3 years ago

You correctly understood my question.

Thanks for your interest and all answers.

I finished my analysis of whole-genome data using TranSurVeyor. I hope that I will make a research containing the results of your tool.

Thanks again. Best wishes.

kaanokay commented 3 years ago

Hi again,

I came across a weird circumtance.

When I analyzed my whole genome data, I realized that chromosome Y coordinates were reported in females. This is biologically impossible. What is reason of this? How should I interpret this?

Best.

Mesh89 commented 3 years ago

Hi,

It depends. If the insertion site is in chromosome Y, then it's a false positive. If the inserted sequence was identified in chromosome Y, then the call may be correct.

kaanokay commented 3 years ago

Hi @Mesh89 ,

Thanks for your all answers.

Best.