Collection of WDL workflows based off the University of Washington TOPMed DCC Best Practices for GWAS. The WDL structure was based upon CWLs written by the Seven Bridges development team.
6
stars
3
forks
source link
[ld-prune] If only 1 file input, error will incorrectly claim you've duplicate sample/variant IDs #71
If someone wants to double-check my work here, that'd be appreciated!
It appears this is due to how SeqArray works. Quite understandably, the merging function doesn't expect to be called on an array with only one file in it. It's possible we could skip the WDL task that calls the merging R script if there is only one file given to the ld pruning workflow, but I'm not certain if there's anything else going on in that task which could invalidate the results.
Rough overview:
[line 282]
# open all GDS files
flist <- vector("list", length(gds.fn))
[line 298]
# samples
samp.id <- samp2.id <- seqGetData(flist[[1L]], "sample.id") # get "sample.id" for first GDS file and set it to samp.id and samp2.id
for (f in flist[-1L]) # for file in list of files
{
s <- seqGetData(f, "sample.id") # get sample.id of current file
samp.id <- unique(c(samp.id, s)) # combine every sample id of current file into a list, then find what is unique in it, then overwrite samp.id with those unique values -- basically, with every iteration, this becomes an increasingly large list of unique sample ids
samp2.id <- intersect(samp2.id, s) # with every iteration, this becomes an increasingly large list of non-unique sample ids
}
[line 358]
# common samples
is_merge_variant <- length(samp2.id)>0L || length(samp.id)==0L
# if there is at least one intersecting sample ID (samp2.id) AND/OR there are no unique sample IDs (samp.id), then is_merge_variant is true
# if is_merge_variant is true
if (is_merge_variant)
{
if (length(variant2.id) > 0L) # if there is at least one overlapping sample ID
{
stop("There are overlapping on both samples and variants, ",
"please merge different samples and variants respectively.")
}
You can verify your data does not have duplicated IDs using R.
If someone wants to double-check my work here, that'd be appreciated!
It appears this is due to how SeqArray works. Quite understandably, the merging function doesn't expect to be called on an array with only one file in it. It's possible we could skip the WDL task that calls the merging R script if there is only one file given to the ld pruning workflow, but I'm not certain if there's anything else going on in that task which could invalidate the results.
Rough overview:
You can verify your data does not have duplicated IDs using R.