DecodeGenetics / graphtyper

Population-scale genotyping using pangenome graphs
http://dx.doi.org/10.1038/ng.3964
MIT License
170 stars 20 forks source link

`graphtyper genotype` produces too many files for `graphtyper vcf_concatenate` to combine #156

Open zachary-foster opened 2 weeks ago

zachary-foster commented 2 weeks ago

This is because graphtyper genotype outputs many small VCFs. For long references with lots of samples, this creates so many files that the command line to graphtyper vcf_concatenate is too long for the shell to run:

.command.sh: line 2: /usr/local/bin/graphtyper: Argument list too long

For the dataset that caused this error, the command to graphtyper vcf_concatenate was 3 million characters long. I know this is probably an unusual dataset and there are workarounds, like combining files in batches and combining them again, but we are using graphtyper in a automated pipeline that has to handle these cases, so it would be nice if this case was handled.

It would be nice if there was a way to make graphtyper genotype make fewer but larger files or make graphtyper vcf_concatenate accept a file-of-filenames like graphtyper genotype does.

hannespetur commented 4 days ago

Hey, thanks for the suggestion. As a workaround you could pass a filelist to bcftools concat to concatenate the VCFs

# Make a vcf_filelist.txt
bcftools concat --naive -Oz -oall.vcf.gz --file-list vcf_filelist.txt

Best, Hannes