DataBiosphere / analysis_pipeline_WDL

Collection of WDL workflows based off the University of Washington TOPMed DCC Best Practices for GWAS. The WDL structure was based upon CWLs written by the Seven Bridges development team.
6 stars 3 forks source link

[ld-prune] If only 1 file input, error will incorrectly claim you've duplicate sample/variant IDs #71

Closed aofarrel closed 2 years ago

aofarrel commented 2 years ago

If someone wants to double-check my work here, that'd be appreciated!

It appears this is due to how SeqArray works. Quite understandably, the merging function doesn't expect to be called on an array with only one file in it. It's possible we could skip the WDL task that calls the merging R script if there is only one file given to the ld pruning workflow, but I'm not certain if there's anything else going on in that task which could invalidate the results.

Rough overview:

[line 282]
# open all GDS files
flist <- vector("list", length(gds.fn))

[line 298]
# samples
samp.id <- samp2.id <- seqGetData(flist[[1L]], "sample.id") # get "sample.id" for first GDS file and set it to samp.id and samp2.id
for (f in flist[-1L]) # for file in list of files
{
    s <- seqGetData(f, "sample.id") # get sample.id of current file
    samp.id <- unique(c(samp.id, s)) # combine every sample id of current file into a list, then find what is unique in it, then overwrite samp.id with those unique values -- basically, with every iteration, this becomes an increasingly large list of unique sample ids
    samp2.id <- intersect(samp2.id, s) # with every iteration, this becomes an increasingly large list of non-unique sample ids
}

[line 358]
# common samples
is_merge_variant <- length(samp2.id)>0L || length(samp.id)==0L 
# if there is at least one intersecting sample ID (samp2.id) AND/OR there are no unique sample IDs (samp.id), then is_merge_variant is true

# if is_merge_variant is true
if (is_merge_variant)
{
    if (length(variant2.id) > 0L) # if there is at least one overlapping sample ID
    {
        stop("There are overlapping on both samples and variants, ",
            "please merge different samples and variants respectively.")
    }

You can verify your data does not have duplicated IDs using R.

library(gdsfmt)
library(SeqArray)

genofile <- seqOpen("[vcftogds pipeline output]")
write.csv(seqGetData(genofile, "variant.id"), "allvars.csv")
write.csv(seqGetData(genofile, "sample.id"), "allsamples.csv")
write.csv(duplicated(seqGetData(genofile, "variant.id")), "dupevars.csv")
write.csv(duplicated(seqGetData(genofile, "sample.id")), "dupesamples.csv")

genofilesubset <- seqOpen("[subset_gds step of ld prune pipeline output]")
write.csv(seqGetData(genofilesubset, "variant.id"), "subset_allvars.csv")
write.csv(seqGetData(genofilesubset, "sample.id"), "subset_allsamples.csv")
write.csv(duplicated(seqGetData(genofilesubset, "variant.id")), "subset_dupevars.csv")
write.csv(duplicated(seqGetData(genofilesubset, "sample.id")), "subset_dupesamples.csv")
aofarrel commented 2 years ago

This was sidestepped in association-aggregate by counting the number of chrs, so it can probably be solved here too.