freeseek / score

Tools to work with GWAS-VCF summary statistics files
MIT License
94 stars 6 forks source link

Feature request: Add INFO about cause of rejection in reject VCF #10

Open davmlaw opened 2 months ago

davmlaw commented 2 months ago

Liftover can fail for a variety of reasons...

Many of the liftover functions return -1 on any error, and the check is < 0

If you defined a range of negative constants, you could return specifically what went wrong, then look it up then add that to the INFO of rejected variants

freeseek commented 2 months ago

If a variant is dropped it means that neither of the two anchors could be mapped using the chain files or if the two anchors mapped to different chains, or if the two anchors mapped to locations too far from each other. Compared to other liftover tools the reasons for dropping a variant are more limited and basically it is always related to the chain breaking or missing at the locus. Would this information really be useful?

davmlaw commented 1 month ago

When a liftover of their data fails, the bio people pick half dozen variants then we do a deep dive to try and work out what happened. This is to verify that it's not due to a bug in the code

I think it reassures them that things are working correctly, and they appreciate much more seeing an error "two anchors mapped to locations too far from each other" vs "liftover failed"

I can try and write a pull request but haven't written C in close to 20 years...

freeseek commented 1 month ago

Maybe give me some minimal documentation for how you would like to see the INFO field engineered, given the three possible return codes, in a way that you deem usable by bio people, and I will write the code myself

davmlaw commented 1 month ago

For my particular use case, I'll be processing the VCF file then displaying it in a GUI for the BIO people

There are no enums in VCF types so I guess just make it a string and fill it in with constants, I'd expect the info to look a bit like:

##INFO=<ID=REJECTION_REASON,Number=A,Type=String,Description="Reason for rejection. Will be 1 of NO_MAP (neither of the two anchors could be mapped using the chain files), DIFFERENT_CHAINS (two anchors mapped to different chains) or DISTANCE_EXCEEDED (the two anchors mapped to locations too far from each other)">

But feel free to change to whatever you think best. Thanks a lot!

davmlaw commented 4 weeks ago

variants can also end up in the reject list for other reasons, eg the contig not being defined in the header - there should be a code for this (eg VCF record error) to distinguish it from ones to do with chains

freeseek commented 3 weeks ago

Try to see if the development version here fixes both problems

davmlaw commented 2 weeks ago

Hi, thanks looks good.

Ran dev version with --write-reject and VCF was written with FILTER containing the reason