Closed KamilSJaron closed 4 years ago
Related, Manta calls inversions using two BND
entries. Will paragraph
genotype it as inversions?
I renamed the question to fit both queries
Hi there, sorry for the thread, got two more errors.
One about unresloved reported by manta (ALV values have a placeholder <INS>
and left and right insertion sequences are pasted in tags) This I temporarily resolved by filtering out unresolved insertions.
The second is Illegal character in ALT allele: [scaf000202:631916[N
in the breakpoint corresponding to an inversion, so I think it answers my question about compatibility with Manta's strange way of reporting inversions - nope, they are not going to be genotyped. Manta provides scripts to transform the .vcf
file to more INV
type of inversion notations, so I should be able to resolve this one by my own (though, again, direct non-tweaked compatibility of the two would be really nice)
Hi Kamil,
You have quite a few questions and here I try to answer them in one comment. Let me know if I miss anything!
Why it is not possible to genotype SVs close borders? Paragraph works by aligning reads to a local SV graph (see our paper main figure 1). That graph needs two flanking nodes for reads to start/end. At the chromosome borders, the flanking is not long enough to effectively align a read across the breakpoint. And that's the error message "Distance between vcf position and chrom start is smaller than read length". Manta uses a assembly-based method on linear reference. It is not limited by whether the variant is near the border or not. But I personally will treat border variant cautiously.
Genotype approach for your data (1 ref, 5 reseq) I have very limited experience about non-model species. Maybe worth try the genotyping approach and check the result, if that doesn't take weeks. One thing to note is that as the reported breakpoint becomes more deviated from the true breakpoint, Paragraph recall will drop (see our paper main figure 3).
BND entries BND is not supported for now. In human data, a significant number of BNDs are imprecisely called other variants (e.g. large SVs, CNVs). For now, it can be dangerous to simply graph-genotype a single breakpoint. The best approach should be to call these variants in their correct form and than genotype them as a whole.
Unresolved insertions Yes as you have seen, Manta (and other SV callers) may report SVs with only sequences around breakpoints. It can be very difficult to assemble the full sequence of a long novel insertion. Paragraph genotypes SVs using reads around breakpoints. I can add a feature in v2.5 to enable genotyping of these insertions (with LEFTINSSEQ and RIGHTINSESEQ both present), if that is helpful.
Illegal character in ALT allele: [scaf000202:631916[N
This is a BND. As mentioned in Q2, I would suggest to skip them for now.
Cool, thanks a lot for a quick and clear response.
One thing to note is that as the reported breakpoint becomes more deviated from the true breakpoint, Paragraph recall will drop (see our paper main figure 3).
I had this worry before, but manually inspecting the calls showed that loads of the SV calls in different individuals have actually exactly the same borders, suggesting that at least some of SVs have well-resolved borders.
Anyway, it will take a bit of exploring on my side now. I will let you know how it went.
Hello again,
I am looking at this manta conversion script and I started to turn it into convertManta2Paragraph_compatible_vcf.py
that would have following modifications
I am actually not sure if all this will yield in success, I am testing § in parallel.
So, my question is, is this script of any interest to § or manta developers? If yes, I am happy to do a PR.
More generally, are you interested to tune the tools for non-model datasets? Wonder how much effort I should make to share my progress/findings...
Hello again,
I really hope this spamming is not too annoying. I am just actively trying to make paragraph work on my data and every time I find it relevant to you, I get back here.
This time I got an error I don't actually understand
Traceback (most recent call last):
File "/ceph/users/kjaron/bin/multigrmpy.py", line 353, in <module>
main()
File "/ceph/users/kjaron/bin/multigrmpy.py", line 349, in main
run(args)
File "/ceph/users/kjaron/bin/multigrmpy.py", line 315, in run
subprocess.check_call(commandline, shell=True, stderr=subprocess.STDOUT)
File "/ceph/users/kjaron/.conda/envs/default_genomics/lib/python3.7/subprocess.py", line 347, in check
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/ceph/users/kjaron/bin/grmpy --response-file=/scratch/kjaron/3_T
Not even sure where to start to look for the cause...
The command:
multigrmpy.py -i $VARIANTS -m $SAMPLES -r $REF -o data/genotyping/3_Tce_ind00_genotyping --scratch-dir /scratch/kjaron/3_Tce_ind00_genotyping --threads 32
The error:
2020-03-13 14:04:26,641 ERROR Traceback (most recent call last):
2020-03-13 14:04:26,720 ERROR File "/ceph/users/kjaron/bin/multigrmpy.py", line 315, in run subprocess.check_call(commandline, shell=True, stderr=subprocess.STDOUT)
2020-03-13 14:04:26,720 ERROR File "/ceph/users/kjaron/.conda/envs/default_genomics/lib/python3.7/subprocess.py", line 347, in check_call raise CalledProcessError(retcode, cmd)
2020-03-13 14:04:26,720 ERROR subprocess.CalledProcessError: Command '/ceph/users/kjaron/bin/grmpy --response-file=/scratch/kjaron/3_Tce_ind00_genotyping/tmpe_7jmue4.txt' returned non-zero exit status 139.
Traceback (most recent call last):
File "/ceph/users/kjaron/bin/multigrmpy.py", line 353, in <module>
main()
File "/ceph/users/kjaron/bin/multigrmpy.py", line 349, in main
run(args)
File "/ceph/users/kjaron/bin/multigrmpy.py", line 315, in run
subprocess.check_call(commandline, shell=True, stderr=subprocess.STDOUT)
File "/ceph/users/kjaron/.conda/envs/default_genomics/lib/python3.7/subprocess.py", line 347, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/ceph/users/kjaron/bin/grmpy --response-file=/scratch/kjaron/3_Tce_ind00_genotyping/tmpe_7jmue4.txt' returned non-zero exit status 139.
(END)
Interesting it's very rare to see an error message from the genotyper without any further details. Will it be ok to share your input VCF?
The .vcf
file is _diploidSV.vcf
output from manta with following modifications:
convertInversion.py
script<INS>
tag)"MantaBND"
)514459
)# copy the header
grep "^##" > Tms_00_manta_diploidSV_corrected.vcf
# filter the rest
cat diploidSV_inversions.vcf | grep -v "^##" | awk '{if ( $2 > 150 ){ print $0 } }' | grep -v "<INS>" | grep -v "MantaBND" | grep -v "514459" >> Tms_00_manta_diploidSV_corrected.vcf
Here is the vcf file: Tms_00_manta_diploidSV_corrected.vcf.zip
<DUP:TANDEM>
looks like a new symbolic allele from Manta output. Before it was <DUP>
.
Will doing this work for your case:
MantaINV:25051:0:1:0:0:0
)<DUP:TANDEM>
with <DUP>
Sorry for a delayed answer, it took me a while to get everything sorted out.
I followed your advice. I adjusted the Manta script for converting BND to INV and added all the other adjustments you suggested, the script is available here. Furthermore, I added a flag to separate SVs to INV/INS/DEL/DUP, to speed up the testing.
Then, I run in parallel all four SV types, but I ended up with the problem reported abobe https://github.com/Illumina/paragraph/issues/42#issuecomment-602268327 for three of them (INS/DUP/DEL), while INV
had another issue again complaining about the formating (conversion to JSON). I will comment now only on the Segmentation fault of INS/DUP/DEL.
I found that although that's all that is in log, there are some temporary files created. There are a bunch of .txt
files and .json
files. The one the was mentioned as problematic in INS
log file had first two lines like this
-r data/final_references/3_Tms_b3v08.fasta.gz -m data/genotyping/Tms_samples.txt -o data/genotyping/3_Tce_ind00_genotyping_INS/genotypes.json.gz -z -t 32 --log-level=warning --log-file data/genotyping/3_Tce_ind00_genotyping_INS/grmpy.log --log-async no -g
/scratch/kjaron/3_Tce_ind00_genotyping_INS/tmp2s380f0p.json
...
followed with plenty of other lines *.json
lines. So I checked the first mentioned json
file (pasted at full bellow). When I look at the corresponding vcf file entry, it looks quite sane, therefore I am not sure what is going on here.
My apologies for another long post. I am trying to provide as many details as possible.
The vcf entry:
3_Tms_b3v08_scaf000003 274919 MantaINS:186:0:0:0:0:0 G GCTTATAGCTCAAAAACTTTTGGTTCTATAAAATATATAGATATATATTTT 343 PASS END=274919;SVTYPE=INS;SVLEN=50;CIGAR=1M50I;CIPOS=0,35
The json file:
"sequences": [
"MantaINS:186:0:0:0:0:0:0",
"REF"
],
"name": "ref-3_Tms_b3v08_scaf000003:274769-274919_ref-3_Tms_b3v08_scaf000003:274920-275069"
},
{
"from": "3_Tms_b3v08_scaf000003:274920-274919:CTTATAGCTCAAAAACTTTTGGTTCTATAAAATATATAGATATATATTTT",
"to": "ref-3_Tms_b3v08_scaf000003:274920-275069",
"sequences": [
"MantaINS:186:0:0:0:0:0:1"
],
"name": "3_Tms_b3v08_scaf000003:274920-274919:CTTATAGCTCAAAAACTTTTGGTTCTATAAAATATATAGATATATATTTT_ref-3_Tms_b3v08_scaf000003:274920-275069"
},
{
"from": "ref-3_Tms_b3v08_scaf000003:274920-275069",
"to": "sink",
"name": "ref-3_Tms_b3v08_scaf000003:274920-275069_sink"
}
],
"paths": [
{
"nodes": [
"ref-3_Tms_b3v08_scaf000003:274769-274919",
"ref-3_Tms_b3v08_scaf000003:274920-275069"
],
"path_id": "REF|1",
"sequence": "REF"
},
{
"nodes": [
"ref-3_Tms_b3v08_scaf000003:274769-274919",
"3_Tms_b3v08_scaf000003:274920-274919:CTTATAGCTCAAAAACTTTTGGTTCTATAAAATATATAGATATATATTTT",
"ref-3_Tms_b3v08_scaf000003:274920-275069"
],
"path_id": "ALT|1",
"sequence": "ALT"
}
],
"target_regions": [
"3_Tms_b3v08_scaf000003:274769-274919",
"3_Tms_b3v08_scaf000003:274920-275069"
],
"sequencenames": [
"ALT",
"MantaINS:186:0:0:0:0:0:0",
"MantaINS:186:0:0:0:0:0:1",
"REF"
],
"model_name": "Graph from /scratch/39854.1.main.q/tmpx1hr0e87.vcf.gz",
"ID": "diploidSV_corrected_decomposed_INS.vcf@ac6635a6efcae13bd11a33819c08f5e37ae5295ba456ff76662bb903ca0b36f0:1"
}
--- edit ---
I tried to dig a bit deeper. So I managed to turn on the info log, to zoom better to place where the program fails, and I get:
grmpy -r data/final_references/3_Tms_b3v08.fasta.gz -m data/genotyping/Tms_samples.txt -o data/genotyping/3_Tms_ind00_genotyping_DUP/genotypes.json.gz -z -t 16 --log-level=info -g /scratch/kjaron/paragrapg_interactive_testing/temp/3_Tms_ind00_genotyping_DUP/tmpfqrnj4at.json
no --response-file
[2020-04-04 17:03:47.555] [Genotyping] [28390] [info] Reference path: data/final_references/3_Tms_b3v08.fasta.gz
[2020-04-04 17:03:47.555] [Genotyping] [28390] [info] Graph spec: /scratch/kjaron/paragrapg_interactive_testing/temp/3_Tms_ind00_genotyping_DUP/tmpfqrnj4at.json
[2020-04-04 17:03:47.555] [Genotyping] [28390] [info] Manifest path: data/genotyping/Tms_samples.txt
[2020-04-04 17:03:47.555] [Genotyping] [28390] [info] Output file path: data/genotyping/3_Tms_ind00_genotyping_DUP/genotypes.json.gz
[2020-04-04 17:03:47.555] [Genotyping] [28390] [info] Aligning for 1 graphs
[2020-04-04 17:03:47.556] [Genotyping] [28390] [critical] Starting alignment for sample Tms_00 (1/6)
[2020-04-04 17:03:47.982] [Genotyping] [28390] [info] Loading parameters for sample Tms_00 graph /scratch/kjaron/paragrapg_interactive_testing/temp/3_Tms_ind00_genotyping_DUP/tmpfqrnj4at.json
[2020-04-04 17:03:47.982] [Genotyping] [28405] [critical] Starting alignment for sample Tms_00 (1/6)
[2020-04-04 17:03:47.982] [Genotyping] [28390] [info] Done loading parameters
[2020-04-04 17:03:47.982] [Genotyping] [28390] [info] [Retrieving for region 3_Tms_b3v08_scaf000003:921740-921889.]
[2020-04-04 17:03:48.057] [Genotyping] [28390] [info] [Retrieved 24 + 0 additional reads]
[2020-04-04 17:03:48.057] [Genotyping] [28390] [info] [Retrieving for region 3_Tms_b3v08_scaf000003:921890-922039.]
[2020-04-04 17:03:48.059] [Genotyping] [28390] [info] [Retrieved 40 + 0 additional reads]
[2020-04-04 17:03:48.059] [Genotyping] [28390] [info] [Retrieving for region 3_Tms_b3v08_scaf000003:936837-936986.]
[2020-04-04 17:03:48.061] [Genotyping] [28390] [info] [Retrieved 37 + 0 additional reads]
[2020-04-04 17:03:48.061] [Genotyping] [28390] [info] [Retrieving for region 3_Tms_b3v08_scaf000003:936987-937136.]
[2020-04-04 17:03:48.062] [Genotyping] [28390] [info] [Retrieved 22 + 0 additional reads]
Segmentation fault (core dumped)
I think figured it out. Paragraph can not handle zipped fasta files.
Sorry for the silence. I think I have now managed to get everything working.
This script takes the manta vcf output file (the diploidSV.vcf) and generates a paragraph compatible file while reporting the filtering stats (not all SVs are "genotypable"). If you would be interested in having the script as part of this repo, let me know I can do a PR.
I also found what non-INS SVs with the SVINSSEQ tag are. These are variants that are a combination of something and insertion (for instance when there is an insertion on a side of inversion): It's briefly discussed here: https://github.com/Illumina/manta/issues/158 Do I get right that paragraph has no capacity to genotype such variants?
I am closing the issue as I believe I finally managed to make manta and paragraph compatible. Thanks for your help, it was truly essential.
Hello, thanks a lot for making paragraph!
I am just figuring out how it works now, so I started with just taking one
.vcf
file generated by manta, and I used the exact same.bam
file to genotype the variants I called (just to see the consistency of manta and paragraph), but I got back an errortried to dig a bit, but the lines of code were not very indicative of why manta does not have troubles to call variants close to the scaffold edges and paragraph does. I removed all variants that started < 150 bases from the start of scaffolds and restarted genotyping and now it seems it runs.
So, I wanted to ask. What is the point? Why it is not possible to genotype SVs close borders? And would it be worth making manta and paragraph compatible?
Thanks
background
I have a bunch of Illumina reseq data (1 reference, 5 reseq individuals) with reasonable coverage (~60x ref, ~15x reseq). I have a non-model species, i.e. without a good library of SVs, but I still think that genotyping individuals is by far smarter idea than just merging SV calls. I am just figuring out what is the best way to create a library of SVs out of SV calls that I will feed to Paragraph to get the same data genotyped on the pool variants I found in the population.
I was thinking before about using SURVIVOR, but the merging does not explicitly resolve the sequences of SVs (discussed here), now I am thinking about just pasting
.vcf
files of all 6 individuals while filtering out only the exact overlaps. Not sure what is the best approach here, any input welcome.