J35P312 / TIDDIT

TIDDIT - structural variant calling
Other
10 stars 0 forks source link

TIDDIT in single-cell DNA-seq #14

Open FrAoJm opened 2 years ago

FrAoJm commented 2 years ago

Hi @J35P312 I am trying to identify SV (translocations mainly) in patients with know translocations. I am setting up the pipeline with a patient with t(14:18) and I have a single-cell bam file (2000 cells) which I split by read-groups (cell ident), getting 2000 individual bam files.

I took one individual bam (corresponding to an individual cell seq) to test your tool and even though I got the expected results derived for the low coverage and probably other calculations of your algorithm, I have been able to detect the translocation!!

CHROM=chr14 | POS=106329985 | ID= SV_2_1 | REF=  N | ALT= N[chr18:60793513[ | Qual= . | Filter=Ploidy | INFO= SVTYPE=BND;CIPOS=0,0;CIEND=0,0;COVA=0.0;COVB=39.7400016784668;LFA=13;LFB=13;LTE=13;OR=0,0,0,13;ORSR=0,0;QUALA=0;QUALB=60 | FORMAT= GT:CN:DV:RV:DR:RR | AACAACCTAGTGATGTGC-1=1/1:.:13:0:0,75:0,0

I was thinking in to use your tool to loop across all the 2000 bams, merge them, and extra info from it, at least, to know which cells carry each translocation (mainly the t(14:18) ).

Also, How can I interpret the translocation breakpoints and the ALT, because probably is not getting the mate break pair in the chr18.

What do you think about these results and the best way to tune your tool to squeeze my data in optimal conditions.

In addition, I got more output files than the described in your user guide; output.ploidy.tab, output.sample.bam(3.5GB), output.signals.tab, output.vcf, and output.wig

Thank you so much!!!

J35P312 commented 2 years ago

Hello! That sounds like a cool project!

Nice to see that you found the translocation, I see that the coverage is rather deep "COVB=39.7400016784668;", and the breakpoint is well supported by 13 fragments "LFA=13;LFB=13;LTE=13"; there are no split reads, maybe the breakpoint is a bit repetitive, or are the reads shorter than 151 bp?

By default, TIDDIT is tuned for high sensitivity; if you want higher precision, you may set the minimum number of pairs to a higher value (I usually use 6). Otherwise, I recomend you to use the default settings.

I use IGV to interpret the translocation breakpoint; there I look if the breakpoint is clean (i.e if the discordant pairs map to only one unique region, or are spread across the genome), if there are any split reads, and if there are small CNV events at the breakpoints. Feel free to post a picture if you want any advice =P.

output.sample.bam(3.5GB) and output.signals.tab are temporary files, I would delete those. output.signals.tab might be interesting, it contains the raw, unclustered signals (split reads and fragments).

Good luck!

FrAoJm commented 2 years ago

Hi! It seems the loop worked nicely! so I have 2000 vcf's :=). I removed in the loop the temporary files to save space (before reading your comment regarding the .signals.tab, I think for now I will work only with the vcf, but I will keep those also to further understand the files and the outputs.

I really appreciate your explanations since I am completely new to DNA-seq in general (I always have worked with transcripts), and I get a bit slow and clumsy regarding the genomic argot and the meaning of each field.

I took one cell-bam file and its correspondent vcf to check the t(14:18) with ALT field= N[chr18:60793513[ into IGV to check, so, of course, I will be happy about any comment you can provide to me. I guess is not feasible to check all the bams, cell by cell, on IGV: N chr18:60793513

In my case, single-cells, do you think it will be useful to merge all the vcf and do it after the annotation or annotate individually?, also, you recommended using vep or snpeff (also NIRVANA), but, which one do you think will suit better for my type of data?

Sorry for bothering you with so many questions (maybe is not the right place to either...)...

Best regards!

FrAoJm commented 2 years ago

Hi again! I am parsing the vcf's (TIDDIT output) in R to make some kind of matrices to analyse cell by cell the genotypes and I found the following problem, some files have different tag in 10th column and the RG group of the bam file. Not completely sure if the problem is at the GATK bam split per RG, or if is in the TIDDIT. This is the vcf of one of these confictive files:

##ALT=<ID=DEL,Description="Deletion">         
##ALT=<ID=DUP,Description="Duplication">         
##ALT=<ID=TDUP,Description="Tandem duplication">         
##ALT=<ID=INV,Description="Inversion">         
##ALT=<ID=INS,Description="Insertion">         
##ALT=<ID=BND,Description="Break end">         
##contig=<ID=chr1,length=249250621>         
##contig=<ID=chr2,length=243199373>         
##contig=<ID=chr3,length=198022430>         
##contig=<ID=chr4,length=191154276>         
##contig=<ID=chr5,length=180915260>         
##contig=<ID=chr6,length=171115067>         
##contig=<ID=chr7,length=159138663>         
##contig=<ID=chr8,length=146364022>         
##contig=<ID=chr9,length=141213431>         
##contig=<ID=chr10,length=135534747>         
##contig=<ID=chr11,length=135006516>         
##contig=<ID=chr12,length=133851895>         
##contig=<ID=chr13,length=115169878>         
##contig=<ID=chr14,length=107349540>         
##contig=<ID=chr15,length=102531392>         
##contig=<ID=chr16,length=90354753>         
##contig=<ID=chr17,length=81195210>         
##contig=<ID=chr18,length=78077248>         
##contig=<ID=chr19,length=59128983>         
##contig=<ID=chr20,length=63025520>         
##contig=<ID=chr21,length=48129895>         
##contig=<ID=chr22,length=51304566>         
##contig=<ID=chrM,length=16571>         
##contig=<ID=chrX,length=155270560>         
##contig=<ID=chrY,length=59373566>         
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">         
##INFO=<ID=END,Number=1,Type=Integer,Description="End of an intra-chromosomal variant">         
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">         
##INFO=<ID=LFA,Number=1,Type=Integer,Description="Links from window A">         
##INFO=<ID=LFB,Number=1,Type=Integer,Description="Links from window B">         
##INFO=<ID=LTE,Number=1,Type=Integer,Description="Links to event">         
##INFO=<ID=COVA,Number=1,Type=Float,Description="Coverage on window A">         
##INFO=<ID=COVM,Number=1,Type=Float,Description="The coverage between A and B">         
##INFO=<ID=COVB,Number=1,Type=Float,Description="Coverage on window B">         
##INFO=<ID=OR,Number=4,Type=Integer,Description="Orientation of the pairs (FF,RR,RF,FR)">         
##INFO=<ID=ORSR,Number=2,Type=Integer,Description="Orientation of the split reads (inverted,normal)">         
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">         
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">         
##INFO=<ID=QUALA,Number=1,Type=Float,Description="The average mapping quality of the reads in window A">         
##INFO=<ID=QUALB,Number=1,Type=Float,Description="The average mapping quality of the reads in window B">         
##FILTER=<ID=BelowExpectedLinks,Description="The number of links or reads between A and B is too small">         
##FILTER=<ID=FewLinks,Description="Unexpectedly low fraction of discordant reads betwen A and B">         
##FILTER=<ID=UnexpectedCoverage,Description="The coverage of the window on chromosome B or A is higher than 4*average coverage">         
##FILTER=<ID=Smear,Description="Window A and Window B overlap">         
##FILTER=<ID=RegionalQ,Description="The mapping quality of the region is lower than the user set limit">         
##FILTER=<ID=MinSize,Description="The variant is smaller than the user set limit">         
##FILTER=<ID=Ploidy,Description="Intrachromosomal variant on a chromosome having 0 ploidy">         
##FILTER=<ID=SplitsVSDiscs,Description="large variant supported mainly by split reads (and not discorant pairs) ">         
##FILTER=<ID=Density,Description="The discordant reads cluster too tightly">         
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">         
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number genotype for imprecise events">         
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="Number of paired-ends that support the event">         
##FORMAT=<ID=RV,Number=1,Type=Integer,Description="Number of split reads that support the event">         
##FORMAT=<ID=DR,Number=2,Type=Integer,Description="Number of paired-ends that supporting the reference allele (breakpoint A, and B)">         
##FORMAT=<ID=RR,Number=2,Type=Integer,Description="Number of reads supporting the reference allele (breakpoint A, and B)">         
##LibraryStats=TIDDIT-2.12.1 Coverage=0 ReadLength=97 MeanInsertSize=239 STDInsertSize=101 Orientation=innie         
##TIDDITcmd="/home/usr/anaconda3/envs/tiddit/bin/tiddit --sv -o AGTAACGTAGTATTACTG-1 --bam ./splits/202105271_Dx_DNA_PROT.tube1.cells.AGTAACGTAGTATTACTG-1.bam"         
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AACAACCTAGTGATGTGC-1

As you can see, the bam is AGTAACGTAGTATTACTG and the colname in the vcf is AACAACCTAGTGATGTGC.

Any thoughts? Thank you in advance!!!

FrAoJm commented 2 years ago

Hi! I came back to this project...:=) I was thinking that maybe the problem is that each single-cell bam has in the header all the barcodes. Or how is Tiddit estimating the last vcf column? is getting it from the reads or reading it from @RG?

I would appreciate any help on this issue...

Best regards!