fritzsedlazeck / Sniffles

Structural variation caller using third generation sequencing
Other
545 stars 91 forks source link

Help to understand results. #87

Closed ptranvan closed 6 years ago

ptranvan commented 6 years ago

Hi,

I am interested more specifically in inversion. After using ngmlr + Siffles (with default options), I got some result that I am not sure how to interpret it :

Scaffold03      370304  28      N       <INV>   .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.8;CHR2=Scaffold03;END=10595729;ZMW=26;STD_quant_start=0.462910;STD_quant_stop=0.000000;Kurtosis_quant_start=5.688923;Kurtosis_quant_stop=2.928172;SVTYPE=INV;SUPTYPE=SR;SVLEN=10225425;STRANDS=++;RE=29   GT:DR:DV        ./.:.:29

Scaffold03      592017  41      N       <INV>   .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.8;CHR2=Scaffold03;END=8483936;ZMW=25;STD_quant_start=7.527727;STD_quant_stop=5.582711;Kurtosis_quant_start=-0.900672;Kurtosis_quant_stop=-0.888777;SVTYPE=INV;SUPTYPE=SR;SVLEN=7891919;STRANDS=++;RE=25   GT:DR:DV        ./.:.:25

If I get the results:

Now I don't understand why the 2nd inversion is reported since it is comprise in the first one ? (looking at coordinates) or I missed something ?.

fritzsedlazeck commented 6 years ago

Hi, Apparently from the alignments there are two INV supported. However, this can indicate a more complex event or just an artifact. E.g. it could be that a segment between these two e.g. start breakpoints jumped and was inverted to the end points.

More likely its an artifact. We are constantly working to improve things. What polyploidy does that organism has and what coverage do you have. Plus are these ONT or Pacbio reads?

From the algorithm point its two distinct events as they may overlap but dont share the same breakpoints. This can be for example in cancer or other higher polyploid regions.

I hope that helps. Fritz

ptranvan commented 6 years ago

So the organism is haploid but there are 2 main haplotypes.

So basically I have the genome of the haplotype 1 and I mapped my 100X PacBio reads (from haplotype 2) against the haplotype 1 genome.

(Note that these reads come from a pool of multiple individuals (for amount of DNA).)

Considering these facts, do you have other recommendation for analysis ?

And how can I detect artefact ?

Thanks a lot for your support.

fritzsedlazeck commented 6 years ago

So the variants are supported by 29 and 25 reads. If you say you have put together multiple animals and have 100x pacbio sequencing that actually could make sense. Maybe its a (highly speculative) polymorphic region meaning that the INV is different between samples. I published something like that for duplications on yeast: https://www.nature.com/articles/ncomms14061

The ZMW calls are also looking positive. Usually an artifact would have worse STD_quant_start or STD_quant_stop. These indicate the variability of the breakpoints supported by each read. Another indication is if ZMW is low and RE is larger.

I hope that helps Fritz

ptranvan commented 6 years ago

1) So the arrtefact would be the first one (with STD_quant_start=0.462910;STD_quant_stop=0.000000;) or both ?

2) In that case, do you think it's better to take the corrected reads (by CANU) as input ?

fritzsedlazeck commented 6 years ago

If the second one. Sorry a higher SD (standard deviation) means that the reads are less agreeing. However, I am actually wondering if that is not maybe a signal... Its really hard to say.

I would check the sequencing complexity or similarity between these regions. If they are not similar this also increases the likelihood that those are true.

Ad2. Its worth a try if you want to . This would give you another confidence.

You have a very interesting signal here if it is not produced by a mapping result. Cheers Fritz

ptranvan commented 6 years ago

Hi Fritz,

1) Something I don't understand about the 1st result:

Scaffold03 370304 28 N <INV> . PASS PRECISE;SVMETHOD=Snifflesv1.0.8 CHR2=Scaffold03;END=10595729;ZMW=26;STD_quant_start=0.462910;STD_quant_stop=0.000000 Kurtosis_quant_start=5.688923;Kurtosis_quant_stop=2.928172;SVTYPE=INV;SUPTYPE=SR SVLEN=10225425;STRANDS=++;RE=29 GT:DR:DV ./.:.:29

How is it possible to detect a large variant like this (10 Mbp) with only 29 reads ? I mean, with an average size of 20Kb let's say, even 29 reads aligned next to each other cannot form a sequence of > 1Mb. Or I missed the way how an inversion is computed ?

2) So I have used the corrected reads after CANU (that should be a consensus), and it seems that I get less "weird" results. For instance:

Scaffold03      1933768 93      N       <INV>   .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.8;CHR2=Scaffold03;END=3732394;ZMW=98;STD_quant_start=0.905539;STD_quant_stop=1.503330;Kurtosis_quant_start=0.326961;Kurtosis_quant_stop=0.369925;SVTYPE=INV;SUPTYPE=SR;SVLEN=1798626;STRANDS=++;RE=100    GT:DR:DV        ./.:.:100

Scaffold03      1933996 98      N       <INV>   .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.8;CHR2=Scaffold03;END=3732163;ZMW=93;STD_quant_start=0.000000;STD_quant_stop=0.000000;Kurtosis_quant_start=4.538854;Kurtosis_quant_stop=1.488539;SVTYPE=INV;SUPTYPE=SR;SVLEN=1798167;STRANDS=++;RE=93     GT:DR:DV        ./.:.:93

a) Could it be possible that those 2 are the same inversion at the end but the first one is an artefact (because of the STD) ??

b) The second one have a STD start and stop = 0, which is a confident result no ?

c) Is there any other filters I can do ? for instance discarding the "IMPRECISE" and increasing the read support treshold (since I have a high coverage) ?

3) I got also some inversions with "STRANDS=--", what does it means?

Thanks a lot.

fritzsedlazeck commented 6 years ago

Hi, at 1: Ah I see. so for SV detection we are looking at split reads (ie. reads were parts are mapped to different locations). In this case you have 29 reads that are split and mapped to different location and in an inverted direction. Does that makes sense? Supplement Figure 2.4 illustrates it.

At 2: a) Yes that can be. With the cleaner reads the alignment seems to concile. Maybe these regions are repeats of some form and without the heavy error rate look more alike.

b)Yes this indicates more agreement on the breakpoints from the individual reads. Just to mention these SD numbers can be inflated in repetitive regions so dont go for a hard threshold to kick them all out.

c) Yes you can do that. Another way is to run things with -genotype that gives you frequencies of the reads supporting the alleles. IMPRECISE reflects the SD.

  1. The STRANDS are helpful for validating things. -- means that you should design a primer left of the first breakpoint going to the left and another on the second breakpoint going to the left. This way if the INV is true you should get a product.

I hope that helps Cheers Fritz

fritzsedlazeck commented 6 years ago

Sorry, still a bit early here. At 3 : primers right of the breakpoints pointing to the left. For both. I sketched it out (attached) sv_eval

fritzsedlazeck commented 6 years ago

Did that help? All questions answered? I will close this issue if that is ok. Thanks Fritz

ptranvan commented 6 years ago

Thanks Fritz, I will dig more about the results.

For now, I don't have more questions. ( I will probably have later ).

fritzsedlazeck commented 6 years ago

Ok thanks. Just let me know. Cheers Fritz

ptranvan commented 6 years ago

Hi Fritz,

I have re-read your paper and I got some questions:

1) Why there are space between supplementary alignments ? (figure 2)

http://image.noelshack.com/fichiers/2018/37/4/1536859376-capture-d-ecran-2018-09-13-a-19-23-17.png

2) Figure 5: I don't understand how you see an INVDUP. Which one is it and can you explain please ?

http://image.noelshack.com/fichiers/2018/37/4/1536859273-capture-d-ecran-2018-09-13-a-18-54-08.png

3) Do you have the same kind of figure for simple inversion ?

Thanks a lot !

fritzsedlazeck commented 6 years ago

Hi, ad 1. There are actually some smaller deletions. We did not want to discuss that in full since it just complicates the story behind the figure. As you can see these are just <10bp...

ad 2. Its figure 5c. We are showing one side of it the triplication/ invudp extends towards the right.

ad 3. I dont have one at hand. This one is from twitter: image

Cheers Fritz

ptranvan commented 6 years ago

Thanks.

I have found something a bit weird: Two inversions that start at the same location but end differntly.

They have both a STD start and stop to 0. Only the Kurtosis score is different and one is called with more read (not a lot: 13 vs 12)

Scaffold03      3575954 2       N       <INV>   .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.8;CHR2=Scaffold03;END=7234322;ZMW=12;STD_quant_start=0.0
00000;STD_quant_stop=0.000000;Kurtosis_quant_start=-nan;Kurtosis_quant_stop=4.904758;SVTYPE=INV;SUPTYPE=SR;SVLEN=3658368;STRANDS=--;RE=12;AF=1    GT:D
R:DV        1/1:0:12

Scaffold03      3575954 3       N       <INV>   .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.8;CHR2=Scaffold03;END=6077948;ZMW=13;STD_quant_start=0.0
00000;STD_quant_stop=0.000000;Kurtosis_quant_start=4.196109;Kurtosis_quant_stop=-nan;SVTYPE=INV;SUPTYPE=SR;SVLEN=2501994;STRANDS=--;RE=13;AF=1    GT:D
R:DV        1/1:0:13

1) What is your interpretation about this ?

2) I am not familiar with the Kurtosis metric, how can it help me to distinguish potential candidate and what does mean a score at -nan (I found this also on some other inversions)

Cheers,

Patrick

fritzsedlazeck commented 6 years ago

Mh. It is either a biased (eg sequence complexity) or is your organism diploid?

Kurtosis is a measurement if the shape of the distribution is normal or uniform distributed. However, it would require many more reads to be robust for that measurement. See the comparison in the supplement of the nmethod paper.

The genotypes look homozygous but this can be wrong. Sniffles only store the reference supporting reads to measure the gebotype and not the overall coverage. Thus overlapping events can easily look like homozygous.

The difference in size is quite big, maybe select the region around the end breakpoints (eg. Using bedtools) and align them to each other.

I hope this helps. Cheers Fritz