broadinstitute / picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
https://broadinstitute.github.io/picard/
MIT License
984 stars 369 forks source link

LiftoverVCF does not liftover INFO/END for SVs #1552

Open cmnbroad opened 4 years ago

cmnbroad commented 4 years ago

Originally reported in the GATK repo:

Affected tool(s) or class(es) GATK LiftoverVcf

Affected version(s) gatk/4.1.7.0

Description The LiftoverVcf generates the following error. The error occurs with SVs where the INFO/END is not also lifted over. This results in INFO/END before the site start position which triggers the error.

Using GATK jar /share/pkg.7/gatk/4.1.7.0/install/bin/gatk-package-4.1.7.0-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /share/pkg.7/gatk/4.1.7.0/install/bin/gatk-package-4.1.7.0-local.jar LiftoverVcf -I b37/HG002_SVs_Tier1_v0.6.vcf.gz -O b38/HG002_SVs_Tier1_v0.6.hg38.vcf.gz -CHAIN grch37_to_grch38.over.chain.gz --REJECT b38/HG002_SVs_Tier1_v0.6.rejected.vcf.gz -R /restricted/projectnb/casa/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa 10:20:35.165 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/share/pkg.7/gatk/4.1.7.0/install/bin/gatk-package-4.1.7.0-local.jar!/com/intel/gkl/native/libgkl_compression.so [Sun Jul 26 10:20:35 EDT 2020] LiftoverVcf --INPUT b37/HG002_SVs_Tier1_v0.6.vcf.gz --OUTPUT b38/HG002_SVs_Tier1_v0.6.hg38.vcf.gz --CHAIN grch37_to_grch38.over.chain.gz --REJECT b38/HG002_SVs_Tier1_v0.6.rejected.vcf.gz --REFERENCE_SEQUENCE /restricted/projectnb/casa/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa --WARN_ON_MISSING_CONTIG false --LOG_FAILED_INTERVALS true --WRITE_ORIGINAL_POSITION false --WRITE_ORIGINAL_ALLELES false --LIFTOVER_MIN_MATCH 1.0 --ALLOW_MISSING_FIELDS_IN_HEADER false --RECOVER_SWAPPED_REF_ALT false --TAGS_TO_REVERSE AF --TAGS_TO_DROP MAX_AF --DISABLE_SORT false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false Jul 26, 2020 10:20:35 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine INFO: Failed to detect whether we are running on Google Compute Engine. [Sun Jul 26 10:20:35 EDT 2020] Executing as farrell@scc-hadoop.bu.edu on Linux 3.10.0-1062.12.1.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.7.0 INFO 2020-07-26 10:20:35 LiftoverVcf Loading up the target reference genome. INFO 2020-07-26 10:20:56 LiftoverVcf Lifting variants over and sorting (not yet writing the output file.) [Sun Jul 26 10:20:56 EDT 2020] picard.vcf.LiftoverVcf done. Elapsed time: 0.36 minutes. Runtime.totalMemory()=5861015552 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp htsjdk.tribble.TribbleException: Badly formed variant context at location chr1:596697; getEnd() was 596797 but this VariantContext contains an END key with value 532177 at htsjdk.variant.variantcontext.VariantContext.validateStop(VariantContext.java:1401) at htsjdk.variant.variantcontext.VariantContext.validate(VariantContext.java:1383) at htsjdk.variant.variantcontext.VariantContext.(VariantContext.java:489) at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:647) at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:638) at picard.util.LiftoverUtils.liftVariant(LiftoverUtils.java:92) at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:426) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305) at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206) at org.broadinstitute.hellbender.Main.main(Main.java:292)

Steps to reproduce Download vcf from here:

ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz

gatk LiftoverVcf -I b37/HG002_SVs_Tier1_v0.6.vcf.gz -O b38/HG002_SVs_Tier1_v0.6.hg38.vcf.gz -CHAIN grch37_to_grch38.over.chain.gz --REJECT b38/HG002_SVs_Tier1_v0.6.rejected.vcf.gz -R /restricted/projectnb/casa/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa

Expected behavior The original b37 vcf has a deletion here:

1 532077 ACATTCATGCTCACTCATACACACCCAGATCATATATACACTCGTGCACACATTCACACTCATACACACCCAAATCATACTCACATTCATGCACACATGTT A SVLEN=-100;;SVTYPE=DEL;END=532177;sizecat=100to299;

The liftover to hg38 should look like this: chr1 596697 REF=ACATTCATGCTCACTCATACACACCCAGATCATATATACACTCGTGCACACATTCACACTCATACACACCCAAATCATACTCACATTCATGCACACATGTT ALT=A INFO Fields SVLEN=-100; SVTYPE=DEL;END=596797;sizecat=100to299;

The error message suggests LiftoverVcf is not updating the INFO/END field from 532177 to 596797 and an error is being triggered since the END is before the start. An incorrect INFO/END will cause problems with tabix and other programs.

Actual behavior It generates an error when the INFO/END is before the start and aborts..

Feature request Liftover INFO/END

Description The INFO/END position also needs to be updated-not just the site position.

yfarjoun commented 4 years ago

This will require a new tool...LiftOverSVs. They behave in a very different way.....I don't think that we know how to lift SVs....does anyone have a good idea?

Thinking about this now for the first time...here's a strawman idea for all to hack at:

Anything else?

jjfarrell commented 4 years ago

We have project level VCFs with a mixture of SNPs and SVs. For that type of VCF, we would prefer one tool rather than two for lifting over. Including both SNPS and SVs in a pVCF is useful for gene based tests for example.

I have not tested this but translocations I would guess are probably problematic with the present liftover since the ALT fields embed chr:postions. These would be more likely found in cancer related sequencing VCFs.

From the VCF spec..... 5.4.6 Event modifiers As mentioned previously, a single rearrangement event can be described as a set of novel adjacencies. For example, a reciprocal rearrangement such as in Figure 7: Figure 7: Rearrangements would be described as:

CHROM POS ID REF ALT QUAL FILTER INFO

2 321681 bnd W G G[13 : 123457[ 6 PASS SVTYPE=BND;MATEID=bnd X;EVENT=RR0 2 321682 bnd V T ]13 : 123456]T 6 PASS SVTYPE=BND;MATEID=bnd U;EVENT=RR0 13 123456 bnd U C C[2 : 321682[ 6 PASS SVTYPE=BND;MATEID=bnd V;EVENT=RR0 13 123457 bnd X A ]2 : 321681]A 6 PASS SVTYPE=BND;MATEID=bnd W;EVENT=RR0 5.4.7 Inversions Similarly an inversion such as in Figure 8: Figure 8: Inversion can be described equivalently in two ways. Either one uses the short hand notation described previously (recommended for simple cases):

CHROM POS ID REF ALT QUAL FILTER INFO

2 321682 INV0 T 6 PASS SVTYPE=INV;END=421681 or one describes the breakends:

CHROM POS ID REF ALT QUAL FILTER INFO

2 321681 bnd W G G]2 : 421681] 6 PASS SVTYPE=BND;MATEID=bnd U;EVENT=INV0 2 321682 bnd V T [2 : 421682[T 6 PASS SVTYPE=BND;MATEID=bnd X;EVENT=INV0 2 421681 bnd U A A]2 : 321681] 6 PASS SVTYPE=BND;MATEID=bnd W;EVENT=INV0 2 421682 bnd X C [2 : 321682[C 6 PASS SVTYPE=BND;MATEID=bnd V;EVENT=INV0

5.4.6 Event modifiers

As mentioned previously, a single rearrangement event can be described as a set of novel adjacencies. For example, a reciprocal rearrangement such as in Figure 7: Figure 7: Rearrangements would be described as:

CHROM POS ID REF ALT QUAL FILTER INFO

2 321681 bnd W G G[13 : 123457[ 6 PASS SVTYPE=BND;MATEID=bnd X;EVENT=RR0 2 321682 bnd V T ]13 : 123456]T 6 PASS SVTYPE=BND;MATEID=bnd U;EVENT=RR0 13 123456 bnd U C C[2 : 321682[ 6 PASS SVTYPE=BND;MATEID=bnd V;EVENT=RR0 13 123457 bnd X A ]2 : 321681]A 6 PASS SVTYPE=BND;MATEID=bnd W;EVENT=RR0 5.4.7 Inversions Similarly an inversion such as in Figure 8: Figure 8: Inversion can be described equivalently in two ways. Either one uses the short hand notation described previously (recommended for simple cases):

CHROM POS ID REF ALT QUAL FILTER INFO

2 321682 INV0 T 6 PASS SVTYPE=INV;END=421681 or one describes the breakends:

CHROM POS ID REF ALT QUAL FILTER INFO

2 321681 bnd W G G]2 : 421681] 6 PASS SVTYPE=BND;MATEID=bnd U;EVENT=INV0 2 321682 bnd V T [2 : 421682[T 6 PASS SVTYPE=BND;MATEID=bnd X;EVENT=INV0 2 421681 bnd U A A]2 : 321681] 6 PASS SVTYPE=BND;MATEID=bnd W;EVENT=INV0 2 421682 bnd X C [2 : 321682[C 6 PASS SVTYPE=BND;MATEID=bnd V;EVENT=INV0

ldgauthier commented 4 years ago

@yfarjoun do you think those SV change proposals would cover all the GVCF cases as well?

jjfarrell commented 3 years ago

@yfarjoun Any progress on the liftover of SVs?

yfarjoun commented 3 years ago

no. I'm not familiar enough with SVs in order to do this.

Sorry.

On Tue, May 4, 2021 at 5:42 PM jjfarrell @.***> wrote:

@yfarjoun https://github.com/yfarjoun Any progress on the liftover of SVs?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/picard/issues/1552#issuecomment-832267080, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAU6JUTSED7XJISVUGXKQELTMBS5HANCNFSM4PJB2Q6A .

ldgauthier commented 3 years ago

@cwhelan I'm guessing the implementation itself isn't too hard if someone could write up a little bit of pseudocode. (We can skip the really tricky complex SVs for a first pass.) Want to nominate someone to draft a procedural outline?

jjfarrell commented 3 years ago

For the liftover of the INFO/END value, this is some pseudo code that should handle most INS/DELs.

If exists(INFO/END and INFO/SVTYPE in ["INS","DEL"]):
  ORIGINAL_END=INFO/END
  ORIGINAL_POS=POS
  LIFTOVER_END=LIFTOVER_POS+(INFO/END-ORIGINAL_POS)
  INFO/END=LIFTOVER_END

A liftover of the END position would another approach and handle edge cases where the reference added sequence within the deletion. The SVLEN in that case would also need to be updated.

lgmgeo commented 4 months ago

It would be so helpful to have a LiftOverSVs tool. Any news?

lgmgeo commented 4 months ago

SV VCF format: Documentation https://samtools.github.io/hts-specs/VCFv4.4.pdf, see section "3 INFO keys used for structural variants"

LiftOver SVs: feature requests

Would it be possible to achieve these requests in the Picard Liftover? I would be so grateful!

lgmgeo commented 4 months ago

This issue is quite old but the need is still present.

If it helps, I've just implemented a liftoverSV tool. If you have any problem or question, please, feel free to create a new issue on GitHub

lgmgeo commented 4 months ago

PS: I will improve this tool based on user demand

jjfarrell commented 4 months ago

Thanks Geoffory! I will test out your new tool.

John

On Fri, Jul 12, 2024 at 4:49 PM Geoffroy Véronique @.***> wrote:

PS: I will improve this tool based on user demand

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/picard/issues/1552#issuecomment-2226334466, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAO6WDM3TBPVRXXJNRWADTTZMA6OXAVCNFSM6AAAAABKNEVSTCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRWGMZTINBWGY . You are receiving this because you commented.Message ID: @.***>

-- John J. Farrell, Ph.D. Biomedical Genetics-Evans 218 Boston University Medical School 72 East Concord Street Boston, MA

ph: 617-358-3562 (New Number)