freeseek / score

Tools to work with GWAS-VCF summary statistics files
MIT License
94 stars 6 forks source link

Liftover should probably sort output VCF #8

Open davmlaw opened 2 months ago

davmlaw commented 2 months ago

Thanks for the tool, it's extremely fast and seems to work well so far (currently evaluating it)

The output VCF appears to be written variant-by-variant from the source file

Sometimes, the relative order of variants can change. For instance:

GRCh38 input (correct order):

1   13115599    11730   G   A   .   .   .
1   13259273    12448   G   A   .   .   .

GRCh37 output:

1   13183071    11730   G   A   .   .   .
1   13112676    12448   C   T   .   .   FLIP

This produces a file that has

Warning: The file is not sorted, for example 1:13112676 comes after 1:13183071

Workaround

Don't use "-o" on liftover but instead pipe into bcftools sort then output to file

davmlaw commented 2 months ago

I am not sure about the bcftools convention on whether you should output valid VCFs from each command, or whether you can rely on users running sort. It's in your instructions but people will forget and leave it out (I did!) so I think it's best to output valid VCFs by default

Perhaps you could add an option to not sort, for efficiency if people have sort later in their pipelines

freeseek commented 2 months ago

BCFtools/liftover is designed as a BCFtools plugin which processes each VCF record independently so this would require a large change in the code. One possibility would be to implement in BCFtools the option to sort the output of a plugin but this would need to be a change within the BCFtools repository. I will change the examples when you run bcftools +liftover -h to reflect the need to sort