Illumina / paragraph

Graph realignment tools for structural variants
Other
150 stars 28 forks source link

Compatibility of paragraph and manta #42

Closed KamilSJaron closed 4 years ago

KamilSJaron commented 4 years ago

Hello, thanks a lot for making paragraph!

I am just figuring out how it works now, so I started with just taking one .vcf file generated by manta, and I used the exact same .bam file to genotype the variants I called (just to see the consistency of manta and paragraph), but I got back an error

Exception: Distance between vcf position and chrom start is smaller than read length.

tried to dig a bit, but the lines of code were not very indicative of why manta does not have troubles to call variants close to the scaffold edges and paragraph does. I removed all variants that started < 150 bases from the start of scaffolds and restarted genotyping and now it seems it runs.

So, I wanted to ask. What is the point? Why it is not possible to genotype SVs close borders? And would it be worth making manta and paragraph compatible?

Thanks

background

I have a bunch of Illumina reseq data (1 reference, 5 reseq individuals) with reasonable coverage (~60x ref, ~15x reseq). I have a non-model species, i.e. without a good library of SVs, but I still think that genotyping individuals is by far smarter idea than just merging SV calls. I am just figuring out what is the best way to create a library of SVs out of SV calls that I will feed to Paragraph to get the same data genotyped on the pool variants I found in the population.

I was thinking before about using SURVIVOR, but the merging does not explicitly resolve the sequences of SVs (discussed here), now I am thinking about just pasting .vcf files of all 6 individuals while filtering out only the exact overlaps. Not sure what is the best approach here, any input welcome.

KamilSJaron commented 4 years ago

Related, Manta calls inversions using two BND entries. Will paragraph genotype it as inversions?

I renamed the question to fit both queries

KamilSJaron commented 4 years ago

Hi there, sorry for the thread, got two more errors.

One about unresloved reported by manta (ALV values have a placeholder <INS> and left and right insertion sequences are pasted in tags) This I temporarily resolved by filtering out unresolved insertions.

The second is Illegal character in ALT allele: [scaf000202:631916[N in the breakpoint corresponding to an inversion, so I think it answers my question about compatibility with Manta's strange way of reporting inversions - nope, they are not going to be genotyped. Manta provides scripts to transform the .vcf file to more INV type of inversion notations, so I should be able to resolve this one by my own (though, again, direct non-tweaked compatibility of the two would be really nice)

traxexx commented 4 years ago

Hi Kamil,

You have quite a few questions and here I try to answer them in one comment. Let me know if I miss anything!

  1. Why it is not possible to genotype SVs close borders? Paragraph works by aligning reads to a local SV graph (see our paper main figure 1). That graph needs two flanking nodes for reads to start/end. At the chromosome borders, the flanking is not long enough to effectively align a read across the breakpoint. And that's the error message "Distance between vcf position and chrom start is smaller than read length". Manta uses a assembly-based method on linear reference. It is not limited by whether the variant is near the border or not. But I personally will treat border variant cautiously.

  2. Genotype approach for your data (1 ref, 5 reseq) I have very limited experience about non-model species. Maybe worth try the genotyping approach and check the result, if that doesn't take weeks. One thing to note is that as the reported breakpoint becomes more deviated from the true breakpoint, Paragraph recall will drop (see our paper main figure 3).

  3. BND entries BND is not supported for now. In human data, a significant number of BNDs are imprecisely called other variants (e.g. large SVs, CNVs). For now, it can be dangerous to simply graph-genotype a single breakpoint. The best approach should be to call these variants in their correct form and than genotype them as a whole.

  4. Unresolved insertions Yes as you have seen, Manta (and other SV callers) may report SVs with only sequences around breakpoints. It can be very difficult to assemble the full sequence of a long novel insertion. Paragraph genotypes SVs using reads around breakpoints. I can add a feature in v2.5 to enable genotyping of these insertions (with LEFTINSSEQ and RIGHTINSESEQ both present), if that is helpful.

  5. Illegal character in ALT allele: [scaf000202:631916[N This is a BND. As mentioned in Q2, I would suggest to skip them for now.

KamilSJaron commented 4 years ago

Cool, thanks a lot for a quick and clear response.

One thing to note is that as the reported breakpoint becomes more deviated from the true breakpoint, Paragraph recall will drop (see our paper main figure 3).

I had this worry before, but manually inspecting the calls showed that loads of the SV calls in different individuals have actually exactly the same borders, suggesting that at least some of SVs have well-resolved borders.

Anyway, it will take a bit of exploring on my side now. I will let you know how it went.

KamilSJaron commented 4 years ago

Hello again,

I am looking at this manta conversion script and I started to turn it into convertManta2Paragraph_compatible_vcf.py that would have following modifications

I am actually not sure if all this will yield in success, I am testing § in parallel.

So, my question is, is this script of any interest to § or manta developers? If yes, I am happy to do a PR.

More generally, are you interested to tune the tools for non-model datasets? Wonder how much effort I should make to share my progress/findings...

KamilSJaron commented 4 years ago

Hello again,

I really hope this spamming is not too annoying. I am just actively trying to make paragraph work on my data and every time I find it relevant to you, I get back here.

This time I got an error I don't actually understand

Traceback (most recent call last):
  File "/ceph/users/kjaron/bin/multigrmpy.py", line 353, in <module>
    main()
  File "/ceph/users/kjaron/bin/multigrmpy.py", line 349, in main
    run(args)
  File "/ceph/users/kjaron/bin/multigrmpy.py", line 315, in run
    subprocess.check_call(commandline, shell=True, stderr=subprocess.STDOUT)
  File "/ceph/users/kjaron/.conda/envs/default_genomics/lib/python3.7/subprocess.py", line 347, in check
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/ceph/users/kjaron/bin/grmpy --response-file=/scratch/kjaron/3_T

Not even sure where to start to look for the cause...

KamilSJaron commented 4 years ago

The command:

multigrmpy.py -i $VARIANTS -m $SAMPLES -r $REF -o data/genotyping/3_Tce_ind00_genotyping --scratch-dir /scratch/kjaron/3_Tce_ind00_genotyping --threads 32

The error:

2020-03-13 14:04:26,641 ERROR    Traceback (most recent call last):
2020-03-13 14:04:26,720 ERROR      File "/ceph/users/kjaron/bin/multigrmpy.py", line 315, in run    subprocess.check_call(commandline, shell=True, stderr=subprocess.STDOUT)
2020-03-13 14:04:26,720 ERROR      File "/ceph/users/kjaron/.conda/envs/default_genomics/lib/python3.7/subprocess.py", line 347, in check_call    raise CalledProcessError(retcode, cmd)
2020-03-13 14:04:26,720 ERROR    subprocess.CalledProcessError: Command '/ceph/users/kjaron/bin/grmpy --response-file=/scratch/kjaron/3_Tce_ind00_genotyping/tmpe_7jmue4.txt' returned non-zero exit status 139.
Traceback (most recent call last):
  File "/ceph/users/kjaron/bin/multigrmpy.py", line 353, in <module>
    main()
  File "/ceph/users/kjaron/bin/multigrmpy.py", line 349, in main
    run(args)
  File "/ceph/users/kjaron/bin/multigrmpy.py", line 315, in run
    subprocess.check_call(commandline, shell=True, stderr=subprocess.STDOUT)
  File "/ceph/users/kjaron/.conda/envs/default_genomics/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/ceph/users/kjaron/bin/grmpy --response-file=/scratch/kjaron/3_Tce_ind00_genotyping/tmpe_7jmue4.txt' returned non-zero exit status 139.
(END)
traxexx commented 4 years ago

Interesting it's very rare to see an error message from the genotyper without any further details. Will it be ok to share your input VCF?

KamilSJaron commented 4 years ago

The .vcf file is _diploidSV.vcf output from manta with following modifications:

# copy the header
grep "^##" > Tms_00_manta_diploidSV_corrected.vcf
# filter the rest
cat diploidSV_inversions.vcf | grep -v "^##" | awk '{if ( $2 > 150 ){ print $0 } }' | grep -v "<INS>" | grep -v "MantaBND" | grep -v "514459" >> Tms_00_manta_diploidSV_corrected.vcf

Here is the vcf file: Tms_00_manta_diploidSV_corrected.vcf.zip

traxexx commented 4 years ago

<DUP:TANDEM> looks like a new symbolic allele from Manta output. Before it was <DUP>. Will doing this work for your case:

KamilSJaron commented 4 years ago

Sorry for a delayed answer, it took me a while to get everything sorted out.

I followed your advice. I adjusted the Manta script for converting BND to INV and added all the other adjustments you suggested, the script is available here. Furthermore, I added a flag to separate SVs to INV/INS/DEL/DUP, to speed up the testing.

Then, I run in parallel all four SV types, but I ended up with the problem reported abobe https://github.com/Illumina/paragraph/issues/42#issuecomment-602268327 for three of them (INS/DUP/DEL), while INV had another issue again complaining about the formating (conversion to JSON). I will comment now only on the Segmentation fault of INS/DUP/DEL.

I found that although that's all that is in log, there are some temporary files created. There are a bunch of .txt files and .json files. The one the was mentioned as problematic in INS log file had first two lines like this

 -r data/final_references/3_Tms_b3v08.fasta.gz -m data/genotyping/Tms_samples.txt -o data/genotyping/3_Tce_ind00_genotyping_INS/genotypes.json.gz -z -t 32 --log-level=warning --log-file data/genotyping/3_Tce_ind00_genotyping_INS/grmpy.log --log-async no -g
/scratch/kjaron/3_Tce_ind00_genotyping_INS/tmp2s380f0p.json
...

followed with plenty of other lines *.json lines. So I checked the first mentioned json file (pasted at full bellow). When I look at the corresponding vcf file entry, it looks quite sane, therefore I am not sure what is going on here.

My apologies for another long post. I am trying to provide as many details as possible.

The vcf entry:

3_Tms_b3v08_scaf000003  274919  MantaINS:186:0:0:0:0:0  G       GCTTATAGCTCAAAAACTTTTGGTTCTATAAAATATATAGATATATATTTT     343     PASS    END=274919;SVTYPE=INS;SVLEN=50;CIGAR=1M50I;CIPOS=0,35

The json file:

            "sequences": [
                "MantaINS:186:0:0:0:0:0:0",
                "REF"
            ],
            "name": "ref-3_Tms_b3v08_scaf000003:274769-274919_ref-3_Tms_b3v08_scaf000003:274920-275069"
        },
        {
            "from": "3_Tms_b3v08_scaf000003:274920-274919:CTTATAGCTCAAAAACTTTTGGTTCTATAAAATATATAGATATATATTTT",
            "to": "ref-3_Tms_b3v08_scaf000003:274920-275069",
            "sequences": [
                "MantaINS:186:0:0:0:0:0:1"
            ],
            "name": "3_Tms_b3v08_scaf000003:274920-274919:CTTATAGCTCAAAAACTTTTGGTTCTATAAAATATATAGATATATATTTT_ref-3_Tms_b3v08_scaf000003:274920-275069"
        },
        {
            "from": "ref-3_Tms_b3v08_scaf000003:274920-275069",
            "to": "sink",
            "name": "ref-3_Tms_b3v08_scaf000003:274920-275069_sink"
        }
    ],
    "paths": [
        {
            "nodes": [
                "ref-3_Tms_b3v08_scaf000003:274769-274919",
                "ref-3_Tms_b3v08_scaf000003:274920-275069"
            ],
            "path_id": "REF|1",
            "sequence": "REF"
        },
        {
            "nodes": [
                "ref-3_Tms_b3v08_scaf000003:274769-274919",
                "3_Tms_b3v08_scaf000003:274920-274919:CTTATAGCTCAAAAACTTTTGGTTCTATAAAATATATAGATATATATTTT",
                "ref-3_Tms_b3v08_scaf000003:274920-275069"
            ],
            "path_id": "ALT|1",
            "sequence": "ALT"
        }
    ],
    "target_regions": [
        "3_Tms_b3v08_scaf000003:274769-274919",
        "3_Tms_b3v08_scaf000003:274920-275069"
    ],
    "sequencenames": [
        "ALT",
        "MantaINS:186:0:0:0:0:0:0",
        "MantaINS:186:0:0:0:0:0:1",
        "REF"
    ],
    "model_name": "Graph from /scratch/39854.1.main.q/tmpx1hr0e87.vcf.gz",
    "ID": "diploidSV_corrected_decomposed_INS.vcf@ac6635a6efcae13bd11a33819c08f5e37ae5295ba456ff76662bb903ca0b36f0:1"
}

--- edit ---

I tried to dig a bit deeper. So I managed to turn on the info log, to zoom better to place where the program fails, and I get:

grmpy -r data/final_references/3_Tms_b3v08.fasta.gz -m data/genotyping/Tms_samples.txt -o data/genotyping/3_Tms_ind00_genotyping_DUP/genotypes.json.gz -z -t 16 --log-level=info -g /scratch/kjaron/paragrapg_interactive_testing/temp/3_Tms_ind00_genotyping_DUP/tmpfqrnj4at.json
no --response-file
[2020-04-04 17:03:47.555] [Genotyping] [28390] [info] Reference path: data/final_references/3_Tms_b3v08.fasta.gz
[2020-04-04 17:03:47.555] [Genotyping] [28390] [info] Graph spec: /scratch/kjaron/paragrapg_interactive_testing/temp/3_Tms_ind00_genotyping_DUP/tmpfqrnj4at.json
[2020-04-04 17:03:47.555] [Genotyping] [28390] [info] Manifest path: data/genotyping/Tms_samples.txt
[2020-04-04 17:03:47.555] [Genotyping] [28390] [info] Output file path: data/genotyping/3_Tms_ind00_genotyping_DUP/genotypes.json.gz
[2020-04-04 17:03:47.555] [Genotyping] [28390] [info] Aligning for 1 graphs
[2020-04-04 17:03:47.556] [Genotyping] [28390] [critical] Starting alignment for sample Tms_00 (1/6)
[2020-04-04 17:03:47.982] [Genotyping] [28390] [info] Loading parameters for sample Tms_00 graph /scratch/kjaron/paragrapg_interactive_testing/temp/3_Tms_ind00_genotyping_DUP/tmpfqrnj4at.json
[2020-04-04 17:03:47.982] [Genotyping] [28405] [critical] Starting alignment for sample Tms_00 (1/6)
[2020-04-04 17:03:47.982] [Genotyping] [28390] [info] Done loading parameters
[2020-04-04 17:03:47.982] [Genotyping] [28390] [info] [Retrieving for region 3_Tms_b3v08_scaf000003:921740-921889.]
[2020-04-04 17:03:48.057] [Genotyping] [28390] [info] [Retrieved 24 + 0 additional reads]
[2020-04-04 17:03:48.057] [Genotyping] [28390] [info] [Retrieving for region 3_Tms_b3v08_scaf000003:921890-922039.]
[2020-04-04 17:03:48.059] [Genotyping] [28390] [info] [Retrieved 40 + 0 additional reads]
[2020-04-04 17:03:48.059] [Genotyping] [28390] [info] [Retrieving for region 3_Tms_b3v08_scaf000003:936837-936986.]
[2020-04-04 17:03:48.061] [Genotyping] [28390] [info] [Retrieved 37 + 0 additional reads]
[2020-04-04 17:03:48.061] [Genotyping] [28390] [info] [Retrieving for region 3_Tms_b3v08_scaf000003:936987-937136.]
[2020-04-04 17:03:48.062] [Genotyping] [28390] [info] [Retrieved 22 + 0 additional reads]
Segmentation fault (core dumped)
KamilSJaron commented 4 years ago

I think figured it out. Paragraph can not handle zipped fasta files.

KamilSJaron commented 4 years ago

Sorry for the silence. I think I have now managed to get everything working.

This script takes the manta vcf output file (the diploidSV.vcf) and generates a paragraph compatible file while reporting the filtering stats (not all SVs are "genotypable"). If you would be interested in having the script as part of this repo, let me know I can do a PR.

I also found what non-INS SVs with the SVINSSEQ tag are. These are variants that are a combination of something and insertion (for instance when there is an insertion on a side of inversion): It's briefly discussed here: https://github.com/Illumina/manta/issues/158 Do I get right that paragraph has no capacity to genotype such variants?

I am closing the issue as I believe I finally managed to make manta and paragraph compatible. Thanks for your help, it was truly essential.