WGLab / LinkedSV

MIT License
21 stars 8 forks source link

No translocations called #20

Open seismicon opened 3 years ago

seismicon commented 3 years ago

Hello,

I am running LinkedSV on a 10x linked-read WGS dataset of a highly rearranged human cell line. The alignment was done previously with the 10x Longranger pipeline. LinkedSV ran through, the results seemed fine (I got all the bedpe files, 1401 large SVs, and the images), but for some reason no translocations were called. I know for sure from a standard WGS analysis and other sources, that there are interchromosomal translocations in that sample.

I used the most recent LinkedSV version with python 2.7, the same hg19 reference genome file that was used for the Longranger alignment, and the following parameters (I substituted all the long directory names with "dir"):

linkedsv.py -i dir/phased_possorted_bam.bam -d dir -r dir/genome.fa -v hg19 -t 8 --somatic_mode

There were some errors concerning reference sequences in standard output, but I don't know if they are related to the problem:

dir/LinkedSV/scripts/../fermikit/fermi.kit/fermi2 simplify -CSo 76 -m 114 -T 71 dir/region_000001/region_000001.all_hap.pre.gz 2> dir/region_000001/region_000001.all_hap.mag.gz.log | gzip -1 > dir/region_000001/region_000001.all_hap.mag.gz [04/29/2021 18:00:25 (6.820 GB)] cd dir/region_000001 && perl dir/LinkedSV/scripts/../fermikit/fermi.kit/run-calling -t 1 dir/region_000001/region_000001.fasta dir/region_000001/region_000001.all_hap.mag.gz | sh

ERROR: can't find the reference sequences gzip: dir/region_000002/region_000002.all_hap.flt.vcf.gz: No such file or directory ERROR: can't find the reference sequences gzip: dir/region_000002/region_000002.all_hap.sv.vcf.gz: No such file or directory

and so on ... the same error comes for a few regions.

But still, the program went on and called deletions, inversion and duplications (they seem fine and match previous WGS analyses), only translocations were missing. I have no idea where to start and look for the problem, so I would be really grateful for your help!

Many thanks, Christoph

fangli80 commented 3 years ago

Hi Christoph, Could you please check the breakpoints using the Loupe software? It would be helpful if you have a Loupe screenshot. Thanks, Li

seismicon commented 3 years ago

Hi Li,

thank you for your reply and help!

In the meantime I reran LinkedSV with temporary files (--save_temp_files) and it just finished. Interestingly, in the raw sv calls in file phased_possorted_bam.bam.raw_large_svcalls.bedpe, there are lots of translocations listed. For example:

chr1 121136682 121136683 chr2 90473950 90473951 TRA N.A. 13 0 R_end R_end 52.794027 0.000000 0.000000 CACAAACCAGACACTT-1,AAGCAGGCAGTCCGGT-1,CTCCACACAGAGGACT-1,CTCGTACGTCAACCGC-1,TCCCTTTCAACGTTTG- chr1 121136826 121136827 chr2 91668571 91668572 TRA N.A. 27 1 R_end L_end 132.651926 0.000000 0.000000 GCCATGGGTGGAACCA-1,GTTACTTCAAGAAAGG-1,GCAACCGCACGTCTCT-1,GACAACTGTCTTGCGG-1,GGGCATCGTAGGACTG- chr1 148954314 148954315 chr2 91822197 91822198 TRA N.A. 73 3 R_end R_end 408.858051 0.000000 0.000000 AGCTGGCGTTTACGAC-1,CATATTCCAGGCTTAT-1,ACTGAACGTTGTGTTG-1,GGGCACTGTGTCCACG-1,GCGCGATTCCGATCTC- chr1 149034198 149034199 chr2 91694644 91694645 TRA N.A. 16 0 R_end R_end 68.880765 0.000000 0.000000 CCTGTACAGCATCCGC-1,TCGTAGAGTCTGCAAT-1,ATCACCCGTGGATAAT-1,GAACCCGGTGAGCTCC-1,GTAGCTAAGTGGTACG- chr1 178249795 178249796 chr2 158704130 158704131 TRA N.A. 579 0 L_end L_end 3762.643289 0.000000 0.000000 GGCGGTTTCTCAGCGG-1,ACAATGCAGGTCTCCG-1,AGCTGATAGAAACGCC-1,CACGACGTCCCAAGTA-1,TGGACGCTCGAGTGGA- chr1 178297772 178297773 chr2 158670958 158670959 TRA N.A. 931 103 R_end R_end 6557.129526 0.000000 0.000000 AGGTGCCTCTATCGCC-1,GGCGGTTTCTCAGCGG-1,TAAGCGTAGCCGACCT-1,CACGACGTCCCAAGTA-1,GTGCAGCTCCTGACGG- chr1 186089231 186089232 chr2 158645391 158645392 TRA N.A. 922 0 R_end L_end 6183.744625 0.000000 0.000000 GGCGGTTTCTCAGCGG-1,TAAGCGTAGCCGACCT-1,GTGCAGCTCCTGACGG-1,TTGTCTATCTTCGGTC-1,TTTACTGAGACTTCCA- chr1 186089189 186089190 chr2 158704130 158704131 TRA N.A. 139 0 R_end L_end 809.519499 0.000000 0.000000 GGCGGTTTCTCAGCGG-1,TCGCTTGCAAGTGAGC-1,ATATCGGTCGACAGCC-1,CCTGGTTCATGTTGCA-1,GCTATAGAGACACGAC- chr1 142701193 142701194 chr4 49249836 49249837 TRA N.A. 53 0 R_end L_end 280.102661 0.000000 0.000000 TAGAAGAAGAGGGCGA-1,CAACGTACAAAGGCTG-1,TGTCTCGCAAGGACTG-1,GTAGATCGTAGGTCGA-1,CCATCCAGTATGCGGA- chr1 142701207 142701208 chr4 49177201 49177202 TRA N.A. 19 0 R_end R_end 85.205640 0.000000 0.000000 ACTGGATTCTAGCAGT-1,TTCGAAGGTCTCTTTA-1,TAAGCGTAGAGCGCTA-1,TGTCTCGCAAGGACTG-1,ACGAGCCAGAGAGCTC- chr1 142730953 142730954 chr4 49248107 49248108 TRA N.A. 50 0 R_end L_end 262.394413 0.000000 0.000000 ATCGAGTGTAGTAGGC-1,GTTCGGGTCGTTCGGG-1,ATGGGAGAGTGGCGAT-1,AGAATCCCATGTGCCG-1,GACGTTACACGAAAGC- chr1 170849709 170849710 chr4 11755137 11755138 TRA N.A. 610 81 L_end L_end 4221.442586 0.000000 0.000000 CACCAAAAGGAGTGTC-1,CACGTAATCACTTTAC-1,GTGTCTCTCCCTGAGG-1,CATCAGAGTGGTAAGC-1,CTCGGGATCTGGTGGC-

.... and so on .... this is just a random set of the raw translocations calls (I also cut the listed barcodes, which would be too much), but I also saw a lot of translocations in there that made sense and match the other WGS data. Maybe it's a problem with filtering of sv calls? Unfortunately, I have no idea what exactly the columns 13-15 mean. Are they relevant for filtering? Is it somewhat possible to easily retrieve a good set of translocations from the temporary file?

I haven't used Loupe in a while and it currently gets stuck opening my files on my mac (maybe the old Loupe from 2016 has problem with newer mac os version?). Anyway, I'll try to find a workaround, but maybe in the meantime, I hope the results from the temporary files also help to clear things up.

I would be really grateful for any hint on how I could proceed from here or how I can interpret the temporary data.

Best, Christoph

seismicon commented 3 years ago

Hi Li,

I kept looking into the temporary files and the source code and I have a question / wild guess:

Are columns 14 and 15 of the raw sv call file "phased_possorted_bam.bam.raw_large_svcalls.bedpe" the "dbo-scores" and does this represent the significance of the barcode overlaps between breakpoints? I saw, that this is used to filter translocations. Sorry, I did not mention that a lot my breakpoints are lying in highly amplified regions and can have a relatively low allelic frequency. I wasn't sure this might be relevant for the problem. But now I'm thinking, could this mess up the "dbo-scores"? I noticed, that a lot of my translocations in phased_possorted_bam.bam.raw_large_svcalls.bedpe are highly covered, with a lot of barcodes in column 16, but the columns 14 and 15 (the "dbo-scores"?) are 0. Can I still trust these calls, if they have a lot of these barcodes in column 16?

Not sure if anything of that is correct, but maybe it can help a little bit to clear things up?

Best, Christoph

fangli80 commented 3 years ago

You're right. columns 14 and 15 are the dbo-scores. LinkedSV has different filters. A call may be filtered out if it is in the black list, or lies in a extremely high coverage region, or low dbo scores. Would you mind sharing the bam file with me and I can figure out why the dbo scores are 0? If your data is confidential, you can email to fangli2718@gmail.com

Thanks, Li

fangli80 commented 3 years ago

@seismicon Hi Christoph, I didn't hear from you. Did you solve the problem or do you still need help?

Best, Li

seismicon commented 3 years ago

Hi Li,

I am very sorry for my late response. It is still an issue I could not really solve. I noticed, however, that the rearrangements from the temporary file are already quite good when filtering manually by the barcode support only and thereby "skipping" the dbo-score. Is that reasonable? Anyway, since it's cell line data, I see no problem in sharing the alignment. However, it is over 30x coverage, so it's quite large. I plan to upload it to ENA soon anyway. So, would it be ok if I come back to you once the data is available online?

Thanks a lot for your help and patience! Best, Christoph

fangli80 commented 3 years ago

Sure. Thanks for letting me know the bug!