BioPsyk / cleansumstats

Convert GWAS sumstat files into a common format with a common reference for positions, rsids and effect alleles.
https://biopsyk.github.io/metadata/#!/form/cleansumstats
14 stars 2 forks source link

investigate if .bed 's 0 position system affects our liftover #197

Closed pappewaio closed 3 years ago

pappewaio commented 3 years ago

And if so, correct it.

http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/

pappewaio commented 3 years ago

Ok, it appears to affect the liftover. To simply take the position-1 on both START and END in the .bed file to go to 0 position system did sort of generate the right result, but the back conversion did give the right result, which indicated to me that something more was needed.

In an attempt to make life easier I tried the CrossMap software's liftover on vcf files. This was harder than it should have been, as it required an fasta file for the reference genome to work. I bite in the sour apple and made it part of the docker image, even though its size would classify as too big. I thought we could always work that out later. After getting it to work and CrossMap to find all relevant files, it was unable to map any of my example positions. I scanned the code in this github repo - which is a fork of the real CrossMap code - to find clues to the error message: 'Fail(KeyError)': https://github.com/david-a-parry/CrossMap/blob/master/bin/CrossMap.py

I could not find the error message or the clue I was searching for, but I started to feel a fear that it was an indication that something wasn't right about how they implemented this code, and that it had been left unattended and malfunctioning.

However there are some interesting parts of the code in how they do their 0 - 1 repositioning, worth summarising: https://github.com/david-a-parry/CrossMap/blob/master/bin/CrossMap.py#L426-L460

The absolutely most interesting piece on how to convert to bed was this, where they set the new start position to -1 and the end to the new start position + the length of field[3]. Assuming it is column 4 in the vcf would point to the ref column. But that would not make sense as it is only length 1 all the time. My python skills are not the best, but I think that is an error, and that it should be field[4], which is the alt column. If it is the alt column, then we get more problems for when the alt alleles are separated by comma ','. And I can't see that they are handling it here.

chrom = fields[0]           
start = int(fields[1])-1    # 0 based           
end = start + len(fields[3])
a = map_coordinates(mapping, chrom, start, end,'+')

The back-conversion from bed to vcf looks like this instead. I figured out that 'a' is a list with two elements a[0] and a[1], first with the positions before liftover and the second with positions after liftover. In the example below you can see how the chromosome is first set, and then the new start position + 1 to get back to vcf 1-position format. Ok so that is their suggested way on how to do a proper repositioning. It does make sense, except that they use ref instead of alt to determine the end position.

fields[0] = a[1][0]             
# update start coordinate               
fields[1] = a[1][1] + 1

Before implementing something similar myself I wanted to confirm with another source on how the do. There is a relatively recent release of a software called 'bedr', which does conversions using the functions 'vcf2bed' and 'bed2vcf'. https://github.com/cran/bedr

Looking at the interesting code for 'vcf2bed', it seems to start the same as CrossMap, but using a more sophisticated update of the end, or at least they picked the right column to do it. It still seems ambigous on how to do this properly for multiallelic indels. For us it doesn't really matter right now as we don't support indels anyways, but a solution for the future that I see would be to split these into their own rows before the liftover.

chr  <- x$CHROM;    
start <- x$POS-1;
end <- x$POS;

#Updating end
    for (i in 1:length(x$ALT)){

        if (grepl(',', x$ALT[i])) {
            #warn if ALT has a comma (ambiguous variant or heterozygous)
            warning("ALT contains a comma and the variant length was decided based on the first element of ALT.")

            ALT.list <- strsplit(x$ALT[i],',');
            end[i] <- x$POS[i]+nchar(ALT.list[[1]][1])-1;
            }
        else {
            end[i] <- x$POS[i]+nchar(x$ALT[i])-1;
            }
        }

Then looking at the interesting code for 'bed2vcf', they just as CrossMap to a simple +1 on start to get the new vcf position. I added here the for loop just to show how they fill all columns they initiated to NA. Perhaps not they most effective way of filling a data.frame 😸

    vcf <- data.frame( CHROM = x$chr, POS = x$start + 1, ID = NA, REF = NA, QUAL = NA, FILTER = NA, INFO = NA,  stringsAsFactors = FALSE);

    for (default in default.vcf.field.names[default.vcf.field.names %in% colnames(x)]) {        vcf[,default] &lt;- x[,default];        }

Even though it was tempting to install and just use their solution, it appears it relies on three external software installed and available in path, so not that straightforward after all. The only complicated thing anyways appears to be on how to update the new end, so reimplementing in awk should be straight forward in comparison to try to run their software. Alright, let us see how it goes.

pappewaio commented 3 years ago

The plan is to use the liftover using the bed format, as it has proved to work for most positions previously. Now the difference will be that I do these conversions of position formats from vcf to bed and then back to vcf again.

pappewaio commented 3 years ago

Seems like it works from the unit test created. See the branch issue-197 for how it was done, and please continue and do it for the remaining 2 liftovers. In case I am not able before tomorrow.

pappewaio commented 3 years ago

Ok, it seems I managed to finish this issue. See PR #198