brentp / vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5
MIT License
365 stars 56 forks source link

Write into the ID column #51

Closed vladsavelyev closed 7 years ago

vladsavelyev commented 7 years ago

Hi Brent,

Thank you for vcfanno, it's really great to be able to delegate nearly all VCF post-processing job to one fast tool.

I wonder if there a way to put an annotation value into the ID column in VCF. I used to add the rsIDs and COSMIC ids into the ID column, and I don't seem to find a way to configure vcfanno do that instead of appending to INFO.

As a quick solution, I could just utilise bcftools to remap the INFO field over to ID. But it would be much better to keep all annotation within on vcfanno call with a proper config file.

What do you think about this?

brentp commented 7 years ago

this seems like a good idea. I think if the conf has names=["ID"] then it could set the ID column. I'll try to get to it this week.

vladsavelyev commented 7 years ago

Sounds great!

A follow-up question (somewhat related to #19), I wonder what would be the behaviour in case of multiple ID annotations, like:

[[annotation]]
file="dbsnp.vcf.gz"
fields=["ID"]
names=["ID"]
ops=["self"]

[[annotation]]
file="cosmic.vcf.gz"
fields=["ID"]
names=["ID"]
ops=["self"]

With the implementation in your mind, would be it replace it or append? It's not uncommon to see VCF records with IDs like rs763287238,rs864622601;COSM1721152 after annotation in my experience. It's okay if it will require a temporary field and an extra [[postannotation]] section to concatenate and remove, but maybe the ID is that kind of field that should always be appended by default (dropping any duplicates thought)?

brentp commented 7 years ago

Hi @vladsaveliev I just pushed a fix for this. It only allow setting ID in [[postannotation]] by using name="ID". You can grab the IDs from e.g. dbsnp and cosmic as in your example above by the usual means in [[annotation]] blocks and then concat them or whatever in the [[postannotation]]

brentp commented 7 years ago

This is out in the latest release. Please give it a try and let me know any issues.

vladsavelyev commented 7 years ago

Thanks Brent,

I tried the following conf:

[[annotation]]
file="/home/vsaveliev/bcbio/genomes/Hsapiens/hg19/variation/dbsnp-147.vcf.gz"
fields=["ID"]
names=["rs_ids"]
ops=["concat"]

[[postannotation]]
name="ID"
fields=["rs_ids", "ID"]
op="lua:rs_ids .. ';' .. ID"
type="String"

And ran like this:

vcfanno ann.toml vars.vcf > ann.vcf

It complains about the missing -lua flag:

vcfanno.go:105: ERROR: requested lua op without specifying -lua flag

Not clear what should be provided inside the lua script? All ops look built-in.

brentp commented 7 years ago

yeah, it thinks you need a lua script since you're using lua. Just add -lua <(echo "") or an empty file.

vladsavelyev commented 7 years ago

Thanks!

When it appends to an empty ID (with empty content represented by .), the values ends up rs369203887;.. I guess some cool lua code can be set to erase dots, but I'm thinking it might be reasonable to make it a default behaviour? I might be wrong.

vladsavelyev commented 7 years ago

Also the postannotation section complains about the records that do not have rsids:

api.go:566: ERROR: in lua postannotation at chr21:10614769 for ID.
<string>:1: cannot perform concat operation between nil and string
stack traceback:
        <string>:1: in main chunk
        [G]: ?
empty values were: [rs_ids]
values were: [.]
code is: rs_ids .. ';' .. ID
vcfanno.go:166: Info Error: rs_ids not found in INFO >> this error/warning may occur many times. 

That's fine and it proceeds with annotation correctly when there is only dbSNP to annotate. However when I add COSMIC, it fails to process those record that are found in only one source our of 2:

[[annotation]]
file="/home/vsaveliev/bcbio/genomes/Hsapiens/hg19/variation/dbsnp-147.vcf.gz"
fields=["ID"]
names=["rs_ids"]
ops=["self"] 

[[annotation]]
file="/home/vsaveliev/bcbio/genomes/Hsapiens/hg19/variation/CosmicCodingMuts_v77.vcf.gz"
fields=["ID"]
names=["cosmic_ids"]
ops=["self"]

[[postannotation]]
name="ID"
fields=["rs_ids", "cosmic_ids", "ID"]
op="lua:rs_ids .. ';' .. cosmic_ids .. ';' .. ID"
type="String"

Only records that are found both in dbSNP and COSMIC are processed correctly:

chr21     15599144  .                                             T      C      142.0   REJECT;REJECT                                          STATUS=Germline;SAMPLE=syn3-tumor;TYPE=SNV;SHIFT3=0;MSI=8;MSILEN=1;SSF=0.58433;SOR=1;LSEQ=TAATACATGTAAACTGTTAA;RSEQ=TTTTTTTCCTTAACCTTTTG;rs_ids=rs2822444
chr21     15599466  rs7280643;COSM149211;.                        A      G      140.0   REJECT;REJECT                                          STATUS=Germline;SAMPLE=syn3-tumor;TYPE=SNV;SHIFT3=0;MSI=2;MSILEN=1;SSF=0.36514;SOR=1.24714;LSEQ=CAGCTCATATAAATGGACTC;RSEQ=CCAACAACCAAGTGACTCTG;rs_ids=rs7280643;cosmic_ids=COSM149211

I guess a more complex lua script will help?

brentp commented 7 years ago

with stuff like this, you have full control over what happens in lua. To test, I just write a lua function and test it using any lua interpreter. here's what I came up with:

function setid(...)
    local t = {...}
    local res = {}
    for i, v in ipairs(t) do
        if v ~= nil and v ~= "" then
            res[#res+1] = v
        end
    end
    return table.concat(res, ";")
end

that will work for any number of fields. You'll use it like:

[[postannotation]]
name="ID"
fields=["rs_ids", "cosmic_ids", "ID"]
op="lua:setid(rs_ids, cosmic_ids, ID)"
type="String"

then you'll put the lua code into some.lua and pass it to vcfanno as -lua some.lua

vladsavelyev commented 7 years ago

Great, thanks! Hoped to avoid diving into lua, but I guess I gotta - and that's cool, it looks powerful.

Thanks again for adding all new features Brent!

brentp commented 7 years ago

sure. let me know any other issues. If you are hitting them the others probably are as well. I'll add that function to the examples/custom.lua in the repo.

vladsavelyev commented 7 years ago

One small edit to handle the empty values (empty IDs are represented by dots . by standard):

function setid(...)
    local t = {...}
    local res = {}
    for i, v in ipairs(t) do
        if v ~= "." and v ~= nil and v ~= "" then
            res[#res+1] = v
        end
    end
    return table.concat(res, ";")
end

However it doesn't work because rs_ids and cosmic_ids are optional in INFO, so when vcfanno doesn't find them, it errors out with Info Error: cosmic_ids not found in INFO >> this error/warning may occur many times. reporting once here..., and it doesn't even call the lua script for such variants. I guess some extra step is needed, by I didn't yet figured out which. I'll dig into it some other day if don't have a quick solution in your mind already.

brentp commented 7 years ago

can you send an example VCF and conf that illustrates the problem? This could be a bug. Wne you see the error, it's actually just a wrning, but you should stall have the function-call if any of the arguments were present.

vladsavelyev commented 7 years ago

Sure, here is a small example: example.zip

vcfanno -lua some.lua small.toml small.vcf.gz

Only chr21:10910311 is found in COSMIC, and both chr21:9483406 and chr21:10910311 are found in dbSNP. However, vcfanno skips chr21:9483406 and doesn't set anything into ID for that variant.

brentp commented 7 years ago

Thanks very much for the nice test-case. I have made a new release with the fix here: https://github.com/brentp/vcfanno/releases/tag/v0.2.1

I also added your example to the test-suite. Thanks again.

vladsavelyev commented 7 years ago

Brilliant, thanks!

vladsavelyev commented 7 years ago

Brent, sorry to bother you again - but one more potential issue.

[[annotation]]
file="dbsnp.small.vcf.gz"
fields=["ID", "CAF"]
names=["rs_ids", "CAF"]
ops=["self", "self"]

If I add an optional INFO field to this config, vcfanno would refuse to run lua for the records missing this field (the same way it was with ID before your fix). A small example to reproduce: example.zip

brentp commented 7 years ago

Shoot. I only delayed the problem in the last "fix". I think it's properly resolved here: https://github.com/brentp/vcfanno/releases/tag/v0.2.2 thanks again for the test-case.

vladsavelyev commented 7 years ago

Great! And thanks for the delete operation, with that I have all I want

dridk commented 4 years ago

I just put the answer here, because you didn't give it explicitly

[[annotation]]
file="dbsnp_146.hg38.vcf.gz"
fields = ["ID"]
ops=["self"]
names=["rsid"]

[[postannotation]]
name="ID"
fields=["rsid"]
op="setid"
type="String"