Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
23 stars 20 forks source link

Add a covariate to VRange object #47

Closed xiw588 closed 3 years ago

xiw588 commented 3 years ago

Hi, I want to add an additional column to my VRange object.

I read in a multi-vcf file first and then make it into a VRange object. I have sampleNames for all of my samples but want to further annotate them into different subsets. Is there any way that I could achieve this goal?

vjcitn commented 3 years ago

In the following I take the pre-existing VRanges vr (from example(VRanges)) and add a variable mycov. Is this adequate?

> vr
VRanges object with 2 ranges and 1 metadata column:
      seqnames    ranges strand         ref              alt     totalDepth
         <Rle> <IRanges>  <Rle> <character> <characterOrRle> <integerOrRle>
  [1]     chr1       1-5      *           T                C             12
  [2]     chr2     10-20      *           A                T             17
            refDepth       altDepth   sampleNames softFilterMatrix |
      <integerOrRle> <integerOrRle> <factorOrRle>         <matrix> |
  [1]              5              7             a             TRUE |
  [2]             10              6             b            FALSE |
      tumorSpecific
          <logical>
  [1]         FALSE
  [2]          TRUE
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
  hardFilters(1): coverage
> vr$mycov = c("a", "b")
> vr
VRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand         ref              alt     totalDepth
         <Rle> <IRanges>  <Rle> <character> <characterOrRle> <integerOrRle>
  [1]     chr1       1-5      *           T                C             12
  [2]     chr2     10-20      *           A                T             17
            refDepth       altDepth   sampleNames softFilterMatrix |
      <integerOrRle> <integerOrRle> <factorOrRle>         <matrix> |
  [1]              5              7             a             TRUE |
  [2]             10              6             b            FALSE |
      tumorSpecific       mycov
          <logical> <character>
  [1]         FALSE           a
  [2]          TRUE           b
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
  hardFilters(1): coverage
xiw588 commented 3 years ago

Thank you! I tried the same way and it worked.

xiw588 commented 3 years ago

Actually I have a followup question on reading in a large vcf. So I have a 20GB compressed vcf that I want to read in. Can you please take a look whether the command I use is correct and how long do you think it will take?

fl="all.vcf.gz" vcffile = open(VcfFile(fl, yieldSize=10000)) repeat { vcf = readVcf(vcffile, "hg19") if (nrow(vcf) == 0) break

do work on chunk

}

Thank you!

vjcitn commented 3 years ago

it looks right to me but support.bioconductor.org might be a better venue for a question like this. you could estimate the time needed by timing a fixed number of yields and then extrapolating to the full number required.